nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

Fix validation of downloaded grid dataset for duplicate LFNs #68

Closed meliache closed 3 years ago

meliache commented 3 years ago

For checking if a grid dataset has been completely downloaded, I compare the output of gb2_ds_list (lists the remote dataset) with the list of downloaded files. However, since gbasf2 release 5, gb2_ds_list also lists duplicate outputs from jobs that had been rescheduled. The gb2_ds_get download script filters those out, but with this, I can't simply use the comparison anymore.

Thus, via this PR I manually filter out the duplicated remote LFN's and only show those with the highest number. For the logic, I adapted some of the code of gbasf2/BelleDirac, in particular the findDuplicatedJobID() function:

Sadly, it also just uses string parsing. I was a little worried that it might brake for old datasets produced with old gbasf2 versions, but I try to recognize if the job LFN is the new version and if it's not I don't remove anything.

There is a gbasf2 issue to do this filtering automatically, see BIIDCD-1240, so after that's merged this code might be simplified.

My wish is that it will be possible to just rely on the exit code of gb2_ds_get to know if a dataset has been successfully and completely downloaded. This whish has been also formulated into an issue, see BIIDCD-1238.

As far as I know @bilokin already had a workaround by disabling the comparison of datasets altogether. Maybe he or some other people could test this. Before doing this fix, I had merged the PR #67 which refactored the download function and made the partial download directories persistent, which made debugging of the download and the implementation of changes much simpler. However, for that PR I hadn't waited for reviews of other people. Maybe it would be good if others could check if their b2luigi works for the head of this branch. Remember, you can just install it for testing, e.g. with

python3 -m pip install --upgrade --user "git+https://github.com/nils-braun/b2luigi@bugfix/workaround-duplicate-files-in-gbasf2v5"
Bilokin commented 3 years ago

Sure, I can test it.

Bilokin commented 3 years ago

Unfortunately, it doesn't work. I think we need a unit test for the get_unique_lfns function, as it is not so trivial. I have these real-life examples:

grid_list = ['Upsilon4SBpcandee_00000_job181817516_00.root',
'Upsilon4SBpcandee_00001_job181817517_00.root',
'Upsilon4SBpcandee_00002_job181817518_00.root',
'Upsilon4SBpcandee_00003_job181817519_00.root',
'Upsilon4SBpcandee_00004_job181817520_00.root',
'Upsilon4SBpcandee_00005_job181817521_00.root',
'Upsilon4SBpcandee_00006_job181817522_00.root',
'Upsilon4SBpcandee_00007_job181817523_00.root',
'Upsilon4SBpcandee_00008_job181817524_00.root',
'Upsilon4SBpcandee_00009_job181817525_00.root',
'Upsilon4SBpcandee_00010_job181817526_00.root',
'Upsilon4SBpcandee_00011_job181817527_00.root',
'Upsilon4SBpcandee_00012_job181817528_00.root',
'Upsilon4SBpcandee_00012_job181817528_02.root',
'Upsilon4SBpcandee_00013_job181817529_00.root',
'Upsilon4SBpcandee_00014_job181817530_00.root',
'Upsilon4SBpcandee_00015_job181817531_00.root',
'Upsilon4SBpcandee_00016_job181817532_00.root',
'Upsilon4SBpcandee_00017_job181817533_02.root',
'Upsilon4SBpcandee_00018_job181817534_01.root',
'Upsilon4SBpcandee_00019_job181817535_00.root',
]
meliache commented 3 years ago

Oh, thanks for noticing that. I thought I had tested it on the command line with some example output but I now realized that my output name contains underscores, e.g. B_ntuple.root. My _get_lfn_upto_reschedule_number() worked for that, not for your case, as I kind of counted the underscores from the beginning:

"_".join(lfn.split("_")[0:4])

I also do dumb mistakes. Now I changed that to to the following:

"_".join(lfn.split("_")[:-1])

which works in both cases:

In [69]: "_".join('u4s_B_candee_00019_job181817535_00.root'.split("_")[:-1])
Out[69]: 'u4s_B_candee_00019_job181817535'

In [70]: "_".join('Upsilon4SBpcandee_00019_job181817535_00.root'.split("_")[:-1])
Out[70]: 'Upsilon4SBpcandee_00019_job181817535'

Definetly should make a unittest for that, because this is a case that is easily testable. Thanks for notifying me.

Bilokin commented 3 years ago

Thanks, after this change I can confirm that my downloads are finishing successfully

meliache commented 3 years ago

I added some unit tests for my LFN helper functions introduced here and reverted a commit from another feature branch that accidentally made it into this.