stephens999 / dscr

Dynamic statistical comparisons in R
16 stars 10 forks source link

add reset scores, reset parser #43

Open stephens999 opened 9 years ago

stephens999 commented 9 years ago

analogous to reset_dsc (method, scenario) need way to reset scores and parsers

ramanshah commented 9 years ago

We'll need to balance flexibility and usability/correctness, though: every step at which a software system caches state makes it harder to think about, harder to document, and more vulnerable to errors. Errors can crop up both on our side (bugs in the software) and on the user side (science mistakes in the dscr project based on forgetting when something was updated or cleared). My experience with users, both the data scientists and consultants in industry and the students here, is that they use organizational software as a replacement, not supplement, to scientific record-keeping. I worry about a situation where work becomes even less reproducible and more error-prone than without the tools.

My experiences with maintaining caches have been almost universally negative. They have always involved months of problematic bugs that lead to unhappy users and occasionally wrong results. Obviously some caching is needed (and is the point of dscr - to store results of expensive computation). But where it is not overwhelmingly needed, always recomputing seems best to me, even it it carries a noticeable penalty in computation.

My vote is to keep the pipeline as bone simple as humanly possible. Happy to discuss, though.

ramanshah commented 9 years ago

Here's another way to look at it: if we want to be able to reset the dsc at different points of a deep pipeline, dscr needs to represent a tree of dependencies for the workflow. If the user clears a node on that graph, dscr needs to clear all of that node's children as dirty as well and prepare to recompute all of that stuff.

This still doesn't protect the user from changing arbitrary code at an arbitrary step in the computation and forgetting to clear the corresponding parts of the dsc. The very best students/users will screw this up 10-20% of the time, when they're tired or under pressure; the weakest will screw this up basically 100% of the time. These kinds of errors can be extremely pernicious because of people's tendency to think of software as an infallible black box. They'll change some code without clearing the cache...and the results in a figure won't actually reflect what the newest version of the code was written to do...and this may only become clear with some exacting forensic work after the paper is published and a reader has a question. A truly safe dsc would pick up any change in the code and automatically mark all of the cached results that were dependent on that code as dirty. That's a very different (and much harder) project; it has been done in slightly different environments http://pegasus.isi.edu/ but requires years of effort by a large team of software people to get it correct.

In the absence of such an effort, I hope that we can keep the hierarchy of cached objects as shallow and transparent as possible.

ramanshah commented 9 years ago

@mstephens I had an idea about how to steal most of the heavy work in making a "safe" dscr with a deep pipeline of parsers and scorers, etc.: steal it from git. One could restrict run_dsc to work only against a clean git commit. Figuring out which parts of the dependency graph are contaminated could be done from the SHA-1 hashes of the objects in the git working tree.

The overall idea is to stop using manual reset verbs, which put the burden of tracking object freshness on the user, and automate the process, perhaps as part of the run_dsc call itself. This would present strong guarantees, after a run_dsc call, that all the R objects in the dsc filesystem reflect the most recent version of the code.

The biggest drawback is that this is paranoid: in the worst case, if the user changed one character in a comment line in a monolithic one-size-fits-all datamaker, this method would mark the whole project as dirty and want to recompute everything. Efficient dsc use would then require users to break down datamakers/methods/scenarios/etc into smaller pieces, ideally one function per file.

My own tastes skew toward rigorous, automated cache correctness even at the expense of paranoia, at least when building tools that less sophisticated users will be touching. But I acknowledge that this is a matter of taste and also that it would be an adjustment for students.

What do you think?

stephens999 commented 9 years ago

@road2stat made a related suggestion of using hashes (although I don't recall the details now), so I'm copying him in. Based on his suggestion I'd been thinking (rather vaguely) about a system that would save a hash of the datamaker function and the method function used when saving results, rather than making use of git machinery. I'm not 100% clear about how your suggestion would work, but the overall idea is certainly something I'm interested in discussing further. Would your plan require you to git commit all the created files too? Long term we probably need to do that anyway, but up to now we've avoided it. And I'm not clear on this. Suppose in run_dsc.R someone changed the line addMethod("mymethod", fn=function1) to addMethod("mymethod",fn=function2) but didn't change the functions function2 and function1. so git hash of function1 and function2 woudln't change, but the actual method would change... for this reason it seemed it might be better to work with hashes of R objects rather than of files.

Definitely interested in discussing further.

stephens999 commented 9 years ago

this is the kind of thing I had in mind: http://cran.r-project.org/web/packages/digest/index.html

ramanshah commented 9 years ago

That's a good point - to prove that a combination of scenario/method/parsers is unchanged would require a deeper introspection of the code than I thought. However, dscr is probably going to need to understand a lot about the dependencies in the code to be able to execute safe/efficient parallel computations. For the parallel implementation, one could certainly re-run the whole workflow from datamaker to output parser for each desired combination, but that's a lot of duplication of work. To do anything more efficient will require some kind of dependency management and some way to block calculations that are contingent on unfinished phases of the workflow. @mengyin has been doing naive parallelization of her dsc (I believe that this involves qsub-ing one scenario/method combination per job) and has experienced sporadic corruption issues; I think this is due to race conditions from a lack of dependency mangement.

stephens999 commented 9 years ago

ok, but I think that this kind of dependency is reasonably easy to mange: i) run all scenarios ii) when complete run all methods iii) when complete run all parsers iv) when complete, run all scores.

it isn't the most efficient, but it also will not be too inefficient in most cases, and is simple... within each of i)-iv) nothing depends on each other, with possible exception of parsers where in principle parser1 could turn A to B, and parser 2 could turn B to C... etc. (behaviour I guess we could disallow if that was necessary)