stephenslab / dsc

Repo for Dynamic Statistical Comparisons project
https://stephenslab.github.io/dsc-wiki
MIT License
12 stars 12 forks source link

Slow signature checking / caching at the end of each module #189

Closed gaow closed 5 years ago

gaow commented 5 years ago

This is separate issue from #171. The problem is during the process of running each module it takes time to verify / rebuild signatures (with --touch) and it can be quite noticible when there are thousands of jobs. Also at the end of each group of modules all actual output files have to be cached that alone take additional time. All added up for a big benchmark, there can be hours of overhead. This is fine when a benchmark is first executed because the actual computation will take much longer. But is not good when maybe only one or two files in the benchmark have to be re-generated but in order to identify them we need to go over tens (or dozens?) of thousands of files.

@jean997 This is a ticket for your complaint about overhead, that I now have your benchmark example to understand the scale of the problem. Need to work on it. Sorry for the inconveniences!

pcarbo commented 5 years ago

@gaow My understanding is that to determine whether an output file has changed, the outfile has to be read. The doesn't seem to be any way to avoid reading the file. Is my understanding correct? If so, then the only way to improve the situation would be to speed up file reading/signature calculation.

gaow commented 5 years ago

If so, then the only way to improve the situation would be to speed up file reading/signature calculation.

Yes, even though we are already reading part of the file to compute the signature. Another option is to not use signatures but time stamps but those would be the last resort. Also there are things can be done to improve the communication between processes that may also save some of the overhead. Need to profile and decide.

gaow commented 5 years ago

Turns out computing the signature takes only fraction of the time, but rather retrieving and cache them is time consuming. Particularly slowness in caching, which relates to my earlier comments on lack of concurrency in modifying sqlite database (that we use to keep signatures). We will introduce a liter way,based on time stamps, to skip the retrieve+cache at user's discretion, but only at rerun step. The main mechanism will remain to be signature based which is robust, and the cost of it can only occur at the first time it runs which will be trivial compared to the actual computations.

pcarbo commented 5 years ago

Nice find!

gaow commented 5 years ago

@jean997 per our discussion I implemented a new option -s sloppy. It allows for a sloppy way to skip existing files based timestamps only -- this is what Make and Snakemake does, and is a lot faster. To use this, you need to check out and install current master of DSC and SoS. Basically each time you rerun it you use -s sloppy instead of --touch.

pcarbo commented 5 years ago

@gaow Let's talk about this on Monday. I wonder if we can build on your "sloppy" solution to provide a more coherent approach that works well generally.

gaow commented 5 years ago

@pcarbo I close this ticket because I believe this issue is now properly taken care of. In short, the hybrid signature e discussed in person was a good idea so the default mode is now a hybrid but is safer than what we discussed (thus comfortably made default). The sloppy mode is therefore more useful when scripts are changed but users want to avoid rerun. It is no longer very much relevant to faster performance because the new default is a lot faster.

pcarbo commented 5 years ago

Great news!

The sloppy mode is therefore more useful when scripts are changed but users want to avoid rerun.

You mean they would use --sloppy --touch?

gaow commented 5 years ago

Well, this is good question: behavior of --sloppy + --touch ... let me think of it. Currently --touch will override --sloppy but one can imaging some "sloppy-touch" ... in any case that's not performance issue and if we discuss that we can use another ticket.

pcarbo commented 5 years ago

Then what combination of options would you recommend I use to avoid re-running when scripts have changed?

gaow commented 5 years ago

Then what combination of options would you recommend I use to avoid re-running when scripts have changed?

-s sloppy to temporarily bypass it. --touch to make it permant. But we'll need some sloppy touch to handle this in the most efficient way. But usually when scripts are changed we should rerun, unless there is constantly changes to line breaks and white spaces. (adding comment string is not problem)

pcarbo commented 5 years ago

-s sloppy to temporarily bypass it.

But if the script has changed, then the time stamps for the module output files will be older than the scripts, so the modules will be re-run according to the help:

"sloppy": skips modules whose timestamp are newer than their upstream modules.

So "sloppy" on its own is not sufficient.

gaow commented 5 years ago

@pcarbo are u on the latest dsc? -sloppy will not check script status. It will only check output and input file stamps. This is what snakemake and gnu make does. I think i made that clear in the latest dsc interface.

pcarbo commented 5 years ago

The help for "default" nor "sloppy" options does not mention anything about whether or not changes to scripts are checked (I have the latest version installed from master):

"default": skips modules whose status have not been changed since previous execution. "sloppy": skips modules whose output timestamp are newer than their upstream modules output. "none": executes DSC from scratch.

Am I looking in the right place?

gaow commented 5 years ago

What I get is :

  "sloppy": skips modules whose output timestamp are
                        newer than their upstream modules output... "sloppy" mode
                        performs faster check. It is also useful to avoild
                        rerun when module scripts are changed but the module
                        outputs are supposed to remain the same. "all" is
                        useful to recover meta-databases without running
                        anything. (default: default)
pcarbo commented 5 years ago

I re-installed again, and I see that text in dsc --help now.