Closed gaow closed 2 years ago
hold on, trying really hard to make CI works again now.
Tests are passing..
Why cannot you do
[step]
output: whatever
sh:
processing
sh: expand=True
md5sum {_output} > {_output.with_suffix('md5')}
The only case that you cannot do this is when sh
is in task
, I think.
Sure. It's just that we will have to do it for every workflow step. I think it warrants a shortcut for this feature. For example the decompress option in action download is such a shortcut
But bash
is an action, which in theory does not know _output
. The only valid place is
[step]
output: whatever
and maybe something like
[steo]
output: whatever, md5=True
Since we are essentially adding an option to file_target.target_signature
, or calling something like file_target.save_md5sum()
, it could be something like -S md5
(enhanced version of -s
that saves md5 for all outputs, and potentially checks md5
for all inputs.
I see ... Are you saying we can do
...
_output.save_md5sum()
at the end of a step? This is good enough I think. We have something like _input.zap()
that has a similar interface
I tend to add -S md5
to specify the "type" of signatures. It is ill-defined since we cannot completely rely on md5 since we have other types of targets but in the future we could have something like -S redis://server
to use a dedicated signature service.
The advantage of this approach is that we can relatively easily do
sos run workflow -s build -S md5
to generate md5 after the workflow has been completed, and
sos run workflow -s assert -S md5
to check md5.
However, sos is currently bloated all such features (_input.zap()
is one of them) and I am hesitating if we should add additional features like this.
The patch is actually quite simple. When -S md5
is specified, file_target
will use complete md5
signature for signature checking, and save .md5
file as an artifact. This applies to all file_target
objects and the usual -s ignore
, -s build
should apply.
Thank you @BoPeng I tested it. It works well!
gw@cordata:~/tmp/05-May-2022$ sos run test.sos -S md5
INFO: Running test:
INFO: test is completed.
INFO: test output: 2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is executed successfully with 1 completed step.
gw@cordata:~/tmp/05-May-2022$ sos run test.sos -S md5
INFO: Running test:
INFO: test (index=0) is ignored due to saved signature
INFO: test output: 2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is ignored with 1 ignored step.
gw@cordata:~/tmp/05-May-2022$ sos run test.sos
INFO: Running test:
INFO: test (index=0) is ignored due to saved signature
INFO: test output: 2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is ignored with 1 ignored step.
I like the behavior when -S md5
is dropped during a rerun the partial signature is still recognized and execution skipped.
I like the behavior when -S md5 is dropped during a rerun the partial signature is still recognized and execution skipped.
This could be a bug if your signatures were built without -S md5
. With -S md5
, the signatures are built using complete md5 instead of partial md5 (using the first and last 500M data), then validation without -S md5
should fail. I could potentially use partial md5 for old signatures but then I will need to calculate md5 twice.
This is a proof-of-concept implementation. We can fine-tune its behavior without future development in mind.
Hmm I see. I was under the impression that when we put in -S md5
we indeed calculate md5 twice, which is the safest and reasonable behavior -- if you agree then we can finalize on that and close this issue without future development in mind? I tried to test it and thought that was the case but now I realize my test above is incomplete!
Making -S md5
the "add-on" feature that does not affect existing signature is indeed the safest option. I suppose calculating md5 twice is ok as long as we do not read the file twice.
You appeared to want to validate input file (and output file?) if the corresponding .md5
exists (#1454), are you still interested?
@BoPeng your proposal in #1454 is hacky but efficient. I don't find it scary. Also, when -S md5
is on there is no need to validate output files because absence of md5sum file indicate problem. I'm not sure about input files to the start of a pipeline and I dont think we need to do anything about it. So with your proposal there plus this patch we should have 2 ways to solved #1454 and that's good enough to me, unless you have other proposals?
Currently the patch generates md5 but missing .md5
does not indicate a failed step (at least not intentionally), which is what #1454 wants. The complexity comes with the two sets of signatures.
-s default -S md5
.md5
file for any file_target
. This happens to all file_target
, including input files. -S md5
, we will generate md5
, similar to -s build -S md5
.Now, #1454 talks about a case when SoS crashes and partial output. Regularly sos would proceed (since there is no signature) but you would like to
but missing .md5 does not indicate a failed step
I would argue that this is at least a very good proxy to a failed step and is enough to warrant concerns when it happens?
If there is no signature and existing md5
Good point. What's the behavior here? I suppose partial md5 should be built as a result of full md5 check and rerun otherwise?
If the output has no signature and no md5. remove it before starting.
Sorry i thought that is already the behavior? For example, if some steps append to an output using >>
then it should start afresh anyways if it is not skipped. I think it is desired and not dangerous -- would you provide a user case when this is not a good idea?
If the output has no signature and no md5. remove it before starting. Sorry i thought that is already the behavior?
You are right, your "sos crash" scenario confused me. I suppose when that happens, sos knows the file is corrupted and will remove it before starting, but users do not, so having .md5
is an indication only to users?
Then let us do this (with -S md5
)
so having .md5 is an indication only to users?
Yes although as you proposed, SoS can also leverage that information when available.
I suppose when that happens, sos knows the file is corrupted and will remove it before starting, but users do not
Right, SoS would know this only because it does not have a signature under ~/.sos
Then let us do this (with -S md5)
Perfect!
Two problems:
When implementing getting both partial and full md5 using one-read, I identified a bug from the previous partial MD5 calculation. I fixed a bug in this PR but then all old signatures will be wrong (sos run
will rerun, unless you use -s build
).
I cannot validate existing md5 all the time because that will cause md5 to be calculated all the time when files are accessed. I changed the behavior to only generate md5 if md5 files does not exist. However, existing md5 files will also be used for validating (e.g. when bypassing a step).
I fixed a bug in this PR but then all old signatures will be wrong
Oh that's a bit unfortunate but it's good you caught that! I cannot speak for others, but our consortium is starting to run my pipeline this month but I think peopel are just starting. If we can make a release some time soon I'll ask people to update. My own group is small / flexible -- i'll just ask them to run with -s build
.
However, existing md5 files will also be used for validating (e.g. when bypassing a step).
and when that fails the output and existing md5 will be removed anyways priorto generating the new output and md5 right? Then I dont see a problem.
Oh that's a bit unfortunate but it's good you caught that!
The bug caused the last 1M of data not counted towards the hash (so 15 M was used instead of 16M). It was an aribitrarily defined partial MD5 so it was never a problem. I can match the new implementation with the old one but it is a bit weird that way.
when that fails the output and existing md5 will be removed anyways priorto generating the new output and md5 right?
Currently
This is technically not a problem but if sos crash in between, we may have corrupted or missing output file with old md5 file, so it is better to clear .md5 when we start a step.
I can match the new implementation with the old one but it is a bit weird that way.
Again I cannot speak for the others but the new change is fine by me and I can inform SoS/DSC users i know of. I'll leave that for you to decide ...
so it is better to clear .md5 when we start a step.
Great -- with that i dont see other potential problems.
Great -- with that i dont see other potential problems.
Removing existing output files can be dangerous, can we change it to renaming to FILENAME.random_hash
? If someone rerun some command by accident, at least the old output files are still there and can be restored. It can be a mess to have these .fa89erf.tmp
files around but it should not be too difficult to remove them later on.
Sorry please allow me to clarify -- if someone rerun some command by accident, and the output files are complete, the built-in signature will ensure that the file is skipped. If the output files are corrupted they should be removed anyways. Is there a situation a corrupted file is still useful thus warrant renaming? I guess the answer is yes because corrupted result is better than no result -- is that your only reason for it?
is that your only reason for it?
Say I want to run the pipeline with a different parameter but then realize that it will take another 3 days and I cannot afford it, then the old output is already gone.
It is a rare scenario, but if it is rare, it might be ok to have some .df8dfa.tmp files around, maybe.
Note that .fadf8d.tmp
is currently used when sos stops abnormally. For example
[20]
output: 'somefile'
sh:
generate somefile
sh:
this step goes wrong.
The old behavior was to remove somefile
when the step fails, but somefile.fadf8d.tmp
may help debug what goes wrong for its next step.
It is a rare scenario, but if it is rare, it might be ok to have some .df8dfa.tmp files around, maybe.
can we only do that when the md5 and file match (indicating no corruption) but scripts change? Or equally good is to remove the random temp after new results are generated. I want to say that i'm fine with occassional tmp files but definitely not cool if they are everywhere.
remove the random temp after new results are generated
I think this is a good idea. Basically removing somefile.dfd8fd.tmp
after somefile
is generated.
I gave up the idea since glob('filename.??????.tmp')
is an expensive operation that is hard to tolerate with lots of output files.
I will merge the PR once the tests pass.
Clearing output before executing is more complicated than I expected. The reason is that we have
[a]
output: something
sos_run('b')
[b]
output: something
When this happens, we actually execute step b
before a
,. Removing something
before executing a
will remove the output generated by b
.
I have reverted the clear-before-executing code to make the tests pass.
Leave the remaining issues to a separate ticket #1498.
I have released sos 0.23.0
@BoPeng I'm developing pipelines for a consortium where we want to distribute the otuputs from SoS runs. We believe it is best to also distribution the MD5SUM associated with outputs. I wonder if we can make SoS do it. The proposed interface is on the action level and not as a parameter. For example:
That'd be very helpful for our project and probably for others too. What do you think?