stephens999 / dscr

Dynamic statistical comparisons in R
16 stars 10 forks source link

Spirit of dscr #59

Closed ramanshah closed 9 years ago

ramanshah commented 9 years ago

Talking today with @xiangzhu and bringing his dsc up to date was illuminating. Like most others who use dscr in practice, he "cheats" dscr by having his method wrappers point to hand-prepared datasets already pre-processed to drop into the various methods he want to test. His input objects are just fragments of filenames, and his wrapper functions simply assemble an input filename from the input fragment and the name of the method to be tested, and load the hand-prepared data residing in this file. This practice struck me as kind of smelly, and our discussion clarified my thoughts on why we claim that dscr is a tool for making research more reproducible. It boils down to users obeying and implementing a single kind of interface:

  1. Each scenario emits an actual dataset, not a pointer to a dataset, in a standardized format
  2. Each method, via a "wrapper," ingests an actual dataset in a standardized format and prepares it for the idiosyncratic needs of a specific method.

To me it seems that the spirit of dscr is that each method gets handed a bitwise identical copy of actual data to process, and that a scientist new to a project will be able to audit for preprocessing errors, configuration choices, etc., by tracing the execution path preceding the run_dsc verb. To date, I've yet to see a student do a real scientific application with dscr that actually implements the above interface with rigor. And worse, the dark "off-the-books" portion of the benchmarking study is usually convoluted and almost never documented. My feeling is that mixing undocumented data-preparation with dscr results in something that is even less reproducible than performing a benchmarking study by hand but with a moderate quality of lab notes describing the process.

Thoughts? Suggestions for application with which we can convey the above somehow with a third vignette?

xiangzhu commented 9 years ago

@ramanshah @stephens999

Since the problem is well-defined and the datasets are relatively small, I can easily modify my dsc to satisfy the two rules above. I will send you the modified dsc sometime next week.

Is it interesting to compare these two types of implementing dsc for specific research problems? The "cheating" way I am using now is working. Perhaps I can create some examples to illustrate the potential harm of not obeying the two rules above? Just let me know.

ramanshah commented 9 years ago

I think just having a full dsc that carefully implements this interface for a realistic benchmarking study would be a major help. We'd be able to share your dsc as an exemplar of how to use dscr to organize one's code and data.

Thanks much for your work on this, Xiang! Sorry if I picked on you too much in the above - what you've done in your dsc already is just the same as the other dscr users so far.

stephens999 commented 9 years ago

to clarify, i dont' see dscr as a tool for making research more reproducible. It is for making it more extensible. If it also helps with reproducibility that is great, but not the main emphasis.

And I'm fine with methods taking a pointer to a dataset (filename) rather than the dataset itself, and envisaged that usage when it was designed.

But I agree with the issue "the dark "off-the-books" portion of the benchmarking study is usually convoluted and almost never documented."

Regarding "My feeling is that mixing undocumented data-preparation with dscr results in something that is even less reproducible than performing a benchmarking study by hand but with a moderate quality of lab notes describing the process." - maybe. But it's the documentation that is the key in both cases. So I'd agree that "dscr does not obviate the need to document your work".

On Fri, Sep 11, 2015 at 12:26 PM, Raman Shah notifications@github.com wrote:

I think just having a full dsc that carefully implements this interface for a realistic benchmarking study would be a major help. We'd be able to share your dsc as an exemplar of how to use dscr to organize one's code and data.

Thanks much for your work on this, Xiang! Sorry if I picked on you too much in the above - what you've done in your dsc already is just the same as the other dscr users so far.

— Reply to this email directly or view it on GitHub https://github.com/stephens999/dscr/issues/59#issuecomment-139606815.

ramanshah commented 9 years ago

OK, I see.

ramanshah commented 9 years ago

One thought I had over the weekend about priorities, regarding the above, is that if strict invariants about reproducible data integrity are not going to be enforced, then parallelization is not important for the viability of dscr - one can simply run and score the computationally costly methods manually with traditional cluster-based job submissions, put in stubs for those methods, and then point one's dsc to the scores after the fact. dscr would still often be a help for organizing the presumably larger number of computationally trivial methods as well as sharing scores for collaborative benchmarking. There is a big cost in complexity and portability for any kind of parallel implementation, including one that goes through BatchJobs, and it would be a big plus, in a way, if we can keep dscr serial.

We could in fact present a tutorial for how to patch an external, manually executed set of results into a dsc with a vignette.

While I've made a lot of progress cleaning up the internals of dscr to admit parallelization, it is still a little ways off, and the work contributes to the maintainability even of a serial implementation. Unless you disagree with this course of action, I'll shelve the BatchJobs project for now and work on the rest of the punch list for readying the package. Making the lightweight "table of scores" ergonomically available on GitHub and adding logic to lock parts of the dsc so that collaborators can extend it without having your files would rise to the top of my own priority list.