vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
274 stars 45 forks source link

How to validate an output file integrity #1454

Closed gaow closed 2 years ago

gaow commented 2 years ago

When a step fails, SoS will remove partially completed output to prevent from using corrupted output. However this clean up does not happen when eg SoS crashes or walltime is out, leaving behind corrupted outputs. Is there an easy way to verify the file signature for given output to ensure that it was successfully generated?

BoPeng commented 2 years ago

When I came across a problem like this, I will add a shell command to remove output directory, so output file will remain if signature checking passes, and be removed otherwise.

If this looks scary. you can move the complete output to another location after the script completes and set it as the true _output. In this way only the complete output will be generated. This is essentially the "sandbox" method but there is no systematic support for it (I think our template feature can be used here).

gaow commented 2 years ago

Thanks @BoPeng ! Another idea would be some options to generate a md5sum explicitly for each output file when completed? This may be a desired feature anyways. I'm recently sharing some analysis reuslts with colleagues. They asked me to include md5sum for integrity checks of the files I share.

BoPeng commented 2 years ago

So this likes a sub-feature of -s? Currently we are using time + filesize + hash as file signature and I believe we ignore hash in some cases for performance considerations. If we are generating md5, we will be generating two sets of hashes.