vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
269 stars 45 forks source link

MD5Sum for output, as an action option #1496

Closed gaow closed 2 years ago

gaow commented 2 years ago

@BoPeng I'm developing pipelines for a consortium where we want to distribute the otuputs from SoS runs. We believe it is best to also distribution the MD5SUM associated with outputs. I wonder if we can make SoS do it. The proposed interface is on the action level and not as a parameter. For example:

bash: ... , md5sum = True

That'd be very helpful for our project and probably for others too. What do you think?

BoPeng commented 2 years ago

hold on, trying really hard to make CI works again now.

BoPeng commented 2 years ago

Tests are passing..

Why cannot you do

[step]
output: whatever
sh:
   processing

sh: expand=True
  md5sum {_output} > {_output.with_suffix('md5')}

The only case that you cannot do this is when sh is in task, I think.

gaow commented 2 years ago

Sure. It's just that we will have to do it for every workflow step. I think it warrants a shortcut for this feature. For example the decompress option in action download is such a shortcut

BoPeng commented 2 years ago

But bash is an action, which in theory does not know _output. The only valid place is

[step]
output: whatever

and maybe something like

[steo]
output: whatever, md5=True

Since we are essentially adding an option to file_target.target_signature, or calling something like file_target.save_md5sum(), it could be something like -S md5 (enhanced version of -s that saves md5 for all outputs, and potentially checks md5 for all inputs.

gaow commented 2 years ago

I see ... Are you saying we can do

...
_output.save_md5sum()

at the end of a step? This is good enough I think. We have something like _input.zap() that has a similar interface

BoPeng commented 2 years ago

I tend to add -S md5 to specify the "type" of signatures. It is ill-defined since we cannot completely rely on md5 since we have other types of targets but in the future we could have something like -S redis://server to use a dedicated signature service.

The advantage of this approach is that we can relatively easily do

sos run workflow -s build -S md5

to generate md5 after the workflow has been completed, and

sos run workflow -s assert -S md5

to check md5.

However, sos is currently bloated all such features (_input.zap() is one of them) and I am hesitating if we should add additional features like this.

BoPeng commented 2 years ago

The patch is actually quite simple. When -S md5 is specified, file_target will use complete md5 signature for signature checking, and save .md5 file as an artifact. This applies to all file_target objects and the usual -s ignore, -s build should apply.

gaow commented 2 years ago

Thank you @BoPeng I tested it. It works well!

gw@cordata:~/tmp/05-May-2022$ sos run test.sos  -S md5
INFO: Running test: 
INFO: test is completed.
INFO: test output:   2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is executed successfully with 1 completed step.
gw@cordata:~/tmp/05-May-2022$ sos run test.sos  -S md5
INFO: Running test: 
INFO: test (index=0) is ignored due to saved signature
INFO: test output:   2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is ignored with 1 ignored step.
gw@cordata:~/tmp/05-May-2022$ sos run test.sos  
INFO: Running test: 
INFO: test (index=0) is ignored due to saved signature
INFO: test output:   2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is ignored with 1 ignored step.

I like the behavior when -S md5 is dropped during a rerun the partial signature is still recognized and execution skipped.

BoPeng commented 2 years ago

I like the behavior when -S md5 is dropped during a rerun the partial signature is still recognized and execution skipped.

This could be a bug if your signatures were built without -S md5. With -S md5, the signatures are built using complete md5 instead of partial md5 (using the first and last 500M data), then validation without -S md5 should fail. I could potentially use partial md5 for old signatures but then I will need to calculate md5 twice.

This is a proof-of-concept implementation. We can fine-tune its behavior without future development in mind.

gaow commented 2 years ago

Hmm I see. I was under the impression that when we put in -S md5 we indeed calculate md5 twice, which is the safest and reasonable behavior -- if you agree then we can finalize on that and close this issue without future development in mind? I tried to test it and thought that was the case but now I realize my test above is incomplete!

BoPeng commented 2 years ago

Making -S md5 the "add-on" feature that does not affect existing signature is indeed the safest option. I suppose calculating md5 twice is ok as long as we do not read the file twice.

You appeared to want to validate input file (and output file?) if the corresponding .md5 exists (#1454), are you still interested?

gaow commented 2 years ago

@BoPeng your proposal in #1454 is hacky but efficient. I don't find it scary. Also, when -S md5 is on there is no need to validate output files because absence of md5sum file indicate problem. I'm not sure about input files to the start of a pipeline and I dont think we need to do anything about it. So with your proposal there plus this patch we should have 2 ways to solved #1454 and that's good enough to me, unless you have other proposals?

BoPeng commented 2 years ago

Currently the patch generates md5 but missing .md5 does not indicate a failed step (at least not intentionally), which is what #1454 wants. The complexity comes with the two sets of signatures.

With -s default -S md5

Now, #1454 talks about a case when SoS crashes and partial output. Regularly sos would proceed (since there is no signature) but you would like to

gaow commented 2 years ago

but missing .md5 does not indicate a failed step

I would argue that this is at least a very good proxy to a failed step and is enough to warrant concerns when it happens?

If there is no signature and existing md5

Good point. What's the behavior here? I suppose partial md5 should be built as a result of full md5 check and rerun otherwise?

If the output has no signature and no md5. remove it before starting.

Sorry i thought that is already the behavior? For example, if some steps append to an output using >> then it should start afresh anyways if it is not skipped. I think it is desired and not dangerous -- would you provide a user case when this is not a good idea?

BoPeng commented 2 years ago

If the output has no signature and no md5. remove it before starting. Sorry i thought that is already the behavior?

You are right, your "sos crash" scenario confused me. I suppose when that happens, sos knows the file is corrupted and will remove it before starting, but users do not, so having .md5 is an indication only to users?

Then let us do this (with -S md5)

  1. When sos signature is read, if there is an md5 file , it will be used to validate the content when the file is used, regardless of existence of sos signature.
  2. When sos signature is written, a new .md5 file will be written.
gaow commented 2 years ago

so having .md5 is an indication only to users?

Yes although as you proposed, SoS can also leverage that information when available.

I suppose when that happens, sos knows the file is corrupted and will remove it before starting, but users do not

Right, SoS would know this only because it does not have a signature under ~/.sos

Then let us do this (with -S md5)

Perfect!

BoPeng commented 2 years ago

Two problems:

  1. When implementing getting both partial and full md5 using one-read, I identified a bug from the previous partial MD5 calculation. I fixed a bug in this PR but then all old signatures will be wrong (sos run will rerun, unless you use -s build).

  2. I cannot validate existing md5 all the time because that will cause md5 to be calculated all the time when files are accessed. I changed the behavior to only generate md5 if md5 files does not exist. However, existing md5 files will also be used for validating (e.g. when bypassing a step).

gaow commented 2 years ago

I fixed a bug in this PR but then all old signatures will be wrong

Oh that's a bit unfortunate but it's good you caught that! I cannot speak for others, but our consortium is starting to run my pipeline this month but I think peopel are just starting. If we can make a release some time soon I'll ask people to update. My own group is small / flexible -- i'll just ask them to run with -s build.

However, existing md5 files will also be used for validating (e.g. when bypassing a step).

and when that fails the output and existing md5 will be removed anyways priorto generating the new output and md5 right? Then I dont see a problem.

BoPeng commented 2 years ago

Oh that's a bit unfortunate but it's good you caught that!

The bug caused the last 1M of data not counted towards the hash (so 15 M was used instead of 16M). It was an aribitrarily defined partial MD5 so it was never a problem. I can match the new implementation with the old one but it is a bit weird that way.

when that fails the output and existing md5 will be removed anyways priorto generating the new output and md5 right?

Currently

  1. we generate output, then create md5 when signature is saved.
  2. if we generate new output, since the new output is newer, the md5 will be updated.

This is technically not a problem but if sos crash in between, we may have corrupted or missing output file with old md5 file, so it is better to clear .md5 when we start a step.

gaow commented 2 years ago

I can match the new implementation with the old one but it is a bit weird that way.

Again I cannot speak for the others but the new change is fine by me and I can inform SoS/DSC users i know of. I'll leave that for you to decide ...

so it is better to clear .md5 when we start a step.

Great -- with that i dont see other potential problems.

BoPeng commented 2 years ago

Great -- with that i dont see other potential problems.

Removing existing output files can be dangerous, can we change it to renaming to FILENAME.random_hash? If someone rerun some command by accident, at least the old output files are still there and can be restored. It can be a mess to have these .fa89erf.tmp files around but it should not be too difficult to remove them later on.

gaow commented 2 years ago

Sorry please allow me to clarify -- if someone rerun some command by accident, and the output files are complete, the built-in signature will ensure that the file is skipped. If the output files are corrupted they should be removed anyways. Is there a situation a corrupted file is still useful thus warrant renaming? I guess the answer is yes because corrupted result is better than no result -- is that your only reason for it?

BoPeng commented 2 years ago

is that your only reason for it?

Say I want to run the pipeline with a different parameter but then realize that it will take another 3 days and I cannot afford it, then the old output is already gone.

It is a rare scenario, but if it is rare, it might be ok to have some .df8dfa.tmp files around, maybe.

BoPeng commented 2 years ago

Note that .fadf8d.tmp is currently used when sos stops abnormally. For example

[20]
output: 'somefile'
sh:
    generate somefile
sh:
   this step goes wrong.

The old behavior was to remove somefile when the step fails, but somefile.fadf8d.tmp may help debug what goes wrong for its next step.

gaow commented 2 years ago

It is a rare scenario, but if it is rare, it might be ok to have some .df8dfa.tmp files around, maybe.

can we only do that when the md5 and file match (indicating no corruption) but scripts change? Or equally good is to remove the random temp after new results are generated. I want to say that i'm fine with occassional tmp files but definitely not cool if they are everywhere.

BoPeng commented 2 years ago

remove the random temp after new results are generated

I think this is a good idea. Basically removing somefile.dfd8fd.tmp after somefile is generated.

BoPeng commented 2 years ago

I gave up the idea since glob('filename.??????.tmp') is an expensive operation that is hard to tolerate with lots of output files.

I will merge the PR once the tests pass.

BoPeng commented 2 years ago

Clearing output before executing is more complicated than I expected. The reason is that we have

[a]
output: something
sos_run('b')

[b]
output: something

When this happens, we actually execute step b before a,. Removing something before executing a will remove the output generated by b.

I have reverted the clear-before-executing code to make the tests pass.

BoPeng commented 2 years ago

Leave the remaining issues to a separate ticket #1498.

BoPeng commented 2 years ago

I have released sos 0.23.0