Closed gaow closed 5 years ago
@gaow My understanding is that to determine whether an output file has changed, the outfile has to be read. The doesn't seem to be any way to avoid reading the file. Is my understanding correct? If so, then the only way to improve the situation would be to speed up file reading/signature calculation.
If so, then the only way to improve the situation would be to speed up file reading/signature calculation.
Yes, even though we are already reading part of the file to compute the signature. Another option is to not use signatures but time stamps but those would be the last resort. Also there are things can be done to improve the communication between processes that may also save some of the overhead. Need to profile and decide.
Turns out computing the signature takes only fraction of the time, but rather retrieving and cache them is time consuming. Particularly slowness in caching, which relates to my earlier comments on lack of concurrency in modifying sqlite database (that we use to keep signatures). We will introduce a liter way,based on time stamps, to skip the retrieve+cache at user's discretion, but only at rerun step. The main mechanism will remain to be signature based which is robust, and the cost of it can only occur at the first time it runs which will be trivial compared to the actual computations.
Nice find!
@jean997 per our discussion I implemented a new option -s sloppy
. It allows for a sloppy way to skip existing files based timestamps only -- this is what Make and Snakemake does, and is a lot faster. To use this, you need to check out and install current master
of DSC and SoS. Basically each time you rerun it you use -s sloppy
instead of --touch
.
@gaow Let's talk about this on Monday. I wonder if we can build on your "sloppy" solution to provide a more coherent approach that works well generally.
@pcarbo I close this ticket because I believe this issue is now properly taken care of. In short, the hybrid signature e discussed in person was a good idea so the default mode is now a hybrid but is safer than what we discussed (thus comfortably made default). The sloppy mode is therefore more useful when scripts are changed but users want to avoid rerun. It is no longer very much relevant to faster performance because the new default is a lot faster.
Great news!
The sloppy mode is therefore more useful when scripts are changed but users want to avoid rerun.
You mean they would use --sloppy --touch
?
Well, this is good question: behavior of --sloppy
+ --touch
... let me think of it. Currently --touch
will override --sloppy
but one can imaging some "sloppy-touch" ... in any case that's not performance issue and if we discuss that we can use another ticket.
Then what combination of options would you recommend I use to avoid re-running when scripts have changed?
Then what combination of options would you recommend I use to avoid re-running when scripts have changed?
-s sloppy
to temporarily bypass it. --touch
to make it permant. But we'll need some sloppy touch to handle this in the most efficient way. But usually when scripts are changed we should rerun, unless there is constantly changes to line breaks and white spaces. (adding comment string is not problem)
-s sloppy
to temporarily bypass it.
But if the script has changed, then the time stamps for the module output files will be older than the scripts, so the modules will be re-run according to the help:
"sloppy": skips modules whose timestamp are newer than their upstream modules.
So "sloppy" on its own is not sufficient.
@pcarbo are u on the latest dsc
? -sloppy
will not check script status. It will only check output and input file stamps. This is what snakemake and gnu make does. I think i made that clear in the latest dsc interface.
The help for "default" nor "sloppy" options does not mention anything about whether or not changes to scripts are checked (I have the latest version installed from master):
"default": skips modules whose status have not been changed since previous execution. "sloppy": skips modules whose output timestamp are newer than their upstream modules output. "none": executes DSC from scratch.
Am I looking in the right place?
What I get is :
"sloppy": skips modules whose output timestamp are
newer than their upstream modules output... "sloppy" mode
performs faster check. It is also useful to avoild
rerun when module scripts are changed but the module
outputs are supposed to remain the same. "all" is
useful to recover meta-databases without running
anything. (default: default)
I re-installed again, and I see that text in dsc --help
now.
This is separate issue from #171. The problem is during the process of running each module it takes time to verify / rebuild signatures (with --touch) and it can be quite noticible when there are thousands of jobs. Also at the end of each group of modules all actual output files have to be cached that alone take additional time. All added up for a big benchmark, there can be hours of overhead. This is fine when a benchmark is first executed because the actual computation will take much longer. But is not good when maybe only one or two files in the benchmark have to be re-generated but in order to identify them we need to go over tens (or dozens?) of thousands of files.
@jean997 This is a ticket for your complaint about overhead, that I now have your benchmark example to understand the scale of the problem. Need to work on it. Sorry for the inconveniences!