nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

gbasf2 check `.root` extension #184

Closed 0ctagon closed 1 year ago

0ctagon commented 1 year ago

When downloading gbasf2 output files, the script doesn't work if the name doesn't contain multiple ..

I propose replacing this line: https://github.com/nils-braun/b2luigi/blob/6844963ff7bd23a0e781e87bed559f62c5e862f0/b2luigi/batch/processes/gbasf2.py#L593

by:

if not output_file_extensions.endswith("root"):

Tested locally with simple output file like output.root.

meliache commented 1 year ago

Thanks for reporting!

Ah okay, I the problem is that os.path.splitext expects a string with a dot . to split on, which is how we originally extracted extensions. The problem is that it only gives the last extension, therefore in the line above we did a normal string split on "." first https://github.com/nils-braun/b2luigi/blob/6844963ff7bd23a0e781e87bed559f62c5e862f0/b2luigi/batch/processes/gbasf2.py#L589-L591 This gives us a string with all filename extensions, e.g. I tested the following in IPython:

In [31]: "output.mdst.root".split(".", maxsplit=1)
Out[31]: ['output', 'mdst.root']

And then to check if it's a rootfile I used path.splitext on the last element. But with output.root the last element is just root and path.splitext will not find an extension in that.

In [32]: "output.root".split(".", maxsplit=1)
Out[32]: ['output', 'root']

In [33]: os.path.splitext("output.root".split(".", maxsplit=1)[-1])
Out[33]: ('root', '')

In [36]: os.path.splitext("output.root".split(".", maxsplit=1)[-1])[-1]
Out[36]: ''

Your solution is find, but it will not guard against filenames like output.groot for example, where the last extension ends with root but is not root. Such files are unlikely to appear in grid jobs, but I would prefer to program defensively, assuming the worst.

A simple solution would be to use another string split, which seems to work for one or multiple dots in the extension:

 In [28]: "output.root".split(".", maxsplit=1)[-1].split(".")[-1]
Out[28]: 'root'

In [29]: "output.mdst.root".split(".", maxsplit=1)[-1].split(".")[-1]
Out[29]: 'root'

Actually, when we do a PR, I'd like to outsource this gbasf2-specific splitting+check functionality into its own function and add some unittests for that, just simply check that it does what's expected for output.root and output.mdst.root for example.