nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

Tasks fail if part of the outputs (but not all) already exist. #105

Open meliache opened 3 years ago

meliache commented 3 years ago

When task (with b2luigi.on_temporary_files has outputs A and B, and A is missing but B exists, the moving of the temporary files after the task had been successful fails if one of the outputs already exists.

Due to the nature of on_temporary_files, this doesn't happen naturally when running tasks because the final outputs can only exist if the task had been successful and thus all outputs should be there. However, this can happen when changing output definitions or when deleting some of the outputs but not others manually.

I solved this in my tasks by extending the process method with a clean-up step, but maybe this is something we want to fix in the moving of temporary files.

Or is this a behaviour a feature and we want the users to be explicitwhat to do with existing data to prevent the loss of it? After all, luigi itself does not clean-up of outputs either. Anyway the user uses a temporary wrapper, if they force-writes the output in the process-method, this will not help because he will only force-write the temporary file location, so I think we should fix that...

meliache commented 3 years ago

Okay I just checked and the copying of outputs after success is not done by b2luigi but by the luigi temporary_path function, in particular the exception that the target exists is called in the function rename_dont_move.

So this seems to be an idea or conscious design-design from luigi. Does anyone experienced with luigi have an idea why they implemented it that way? Just wanted to ask here before opening a luigi issue/PR.

nils-braun commented 3 years ago

Luigi Tasks should not have multiple output files in theory. That is against the atomicity and idempotency of tasks (two properties you normally aim for in workflow managers). So yes, it is a design decision.

Typically, you would use a _SUCCESS file if this is you use-case after all files have been written (and only depend on it).

However, in practice these things do occur and fixing it seems like a plausible solution. Just wanted to give some background.