nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

b2luigi/batch/processes/gbasf2.py: improve printout info for download of datasets #98

Closed ArturAkh closed 3 years ago

ArturAkh commented 3 years ago

Include also checks for missing files locally, if such entries are present & in case some not needed files were downloaded.

meliache commented 3 years ago

Out of curiosity, did you see any cases where additional_files was not empty? E.g. ideally I thought this should not happen, but I it's not out of the question since we use wildcard downloads and gbasf2 has often bugs with them and this would be useful for debugging.

But I would disable printing all downloaded files by default, at least in the case of successful downloads.

Btw, I'm happy that you learned from my remark that instead of len(list) > 0: you can just use if list, it looks very readable!

Note FYI not directly related to this PR:

One major task that I would like to do in the future (not this PR) is to introduce proper logging to the gbasf2 batch. When I first wrote it, my idea was that the log_file_dir should only contain logs of the actual job on the batch, which is why I copy the grid output sandbox there by default. All the outputs of the gbasf2 wrapper itself are printed to stdout and stderr, same for the output of gbasf2 commands.

But in hindsight I think it would be best to get also all the gbasf2 output into log files, both the output of gbasf2 subprocesses and all the warnings and printouts in the wrapper. The exception are gbasf2 commands that require user input, which at the moment is only the gb2_proxy_init command when the user has to enter their certificate password for their grid certificate. Also I think the submission and download commands require user input if the --force is disabled via settings. But it should be possible to both print the output and log it. Maybe I should create an issue for that.

One issue which I've been eyeing for a long time but didn't take the time yet to implement is #5 ("Include proper task overview while running") for a gridcontrol-like task view. But then there would be no screen real-estate for very verbose print-outs, which would go to the log files or only to stdout if the task is e.g. run with --test or with the task overview disabled via some setting.

meliache commented 3 years ago

Out of curiosity, did you see any cases where additional_files was not empty?

Oh I saw you made another PR #99 and probably it looks like that answers my question :sweat_smile:

meliache commented 3 years ago

But I would disable printing all downloaded files by default, at least in the case of successful downloads.

What I mean is, which I just realized was already here before the PR and introduced by me :see_no_evil:

if not self._local_gb2_dataset_is_complete(output_file_name, check_temp_dir=True, verbose=True):

I think what I wanted was only to print the verbose output if the local dataset is not complete, but I just realized that the verbose output is printed in any cases whether the download was successful or not :facepalm:. Not your fault so I could fix this seperately from this PR.

ArturAkh commented 3 years ago

Out of curiosity, did you see any cases where additional_files was not empty?

Actually in a very very specific case, this could happen ^^.

An example of this is when you have an gbasf2 task resubmitted again with exactly the same name (can be done, if you delete your jobs of the task with the same name beforehand).

In case you have forgotten to remove the outputs from storage from the previous task with the same name, you will most probably have a similar output name except the job ID. This is not recognized as a duplicated (since the duplicate counter could be the same), but still is one, because of different job numbers.

In that case you would download too many files.