nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

Hardcoded extension ".root" in _get_gbasf2_dataset_query in b2luigi/batch/processes/gbasf2.py #180

Closed schmitca closed 1 year ago

schmitca commented 1 year ago

When using b2luigi with gbasf2 jobs with udst outputs, these will have the ending ".udst.root". Then the download gbasf2_data_query will be faulty:

f"sub*/{output_file_stem}_*{output_file_ext}"= sub*/filename.udst_*.root

instead of "sub*/filename_*.udst.root". This will cause the download to fail.

meliache commented 1 year ago

I'm not 100% sure, can you clarify

gbasf2_data_query will be faulty: f"sub*/{output_file_stem}_*{output_file_ext}"= sub*/filename.udst_*.root instead of "sub*/filename_*.udst.root". This will cause the download to fail.

Which is correct, the first glob-expression or the second glob-expression?

I looked the b2luigi code for _get_gbasf2_dataset_query to see what's actually happening, the important part is:

   output_file_stem, output_file_ext = os.path.splitext(output_file_name)
   # [...]
   dataset_query_string = \
            f"{output_lpn_dir}/{self.gbasf2_project_name}/sub*/{output_file_stem}_*{output_file_ext}"

We didn't hardcode .root as you claim in the issue name. Instead, os.path.splitext just gives the top-level filename-extension , i.e. the part after the last dot. I tested it in IPython:

os.path.splitext("/path/to/filename.udst.root")
Out[11]: ('/path/to/filename.udst', '.root')

which results in the glob pattern "sub*/filename.udst*.root". Does gbasf2 always insert the job number after the first dot in the filename instead? If so, I'm wondering what happens if you add arbitrary dots to your filename, e.g. a.b.c.root. Will it create outputs like a_*.b.c.root? If I know that this is generally the case, and they don't explicityly check for udst in the filename, then it will be easy to implement.

Btw, from reading the release notes I thought that gbasf2 puts udst and mdst files into separate subdirectories, which is why I had created issue #58 in the past, but it seems that this is not the case anymore. I never worked with udst/mdst files and currently am not even using gbasf2 actively anymore. Actually I don't have time to work on b2luigi due to my thesis but this seems like it should be easy to fix. Thanks for reporting :pray: .

meliache commented 1 year ago

I created a hotfix in pull request #181, but I didn't test that at all and don't have time for that right now, so I would wait for your feedback on that and if it works merge. I'd also be happy if you code check my code if it makes sense.