Closed ramanshah closed 9 years ago
@ramanshah @stephens999
Since the problem is well-defined and the datasets are relatively small, I can easily modify my dsc to satisfy the two rules above. I will send you the modified dsc sometime next week.
Is it interesting to compare these two types of implementing dsc for specific research problems? The "cheating" way I am using now is working. Perhaps I can create some examples to illustrate the potential harm of not obeying the two rules above? Just let me know.
I think just having a full dsc that carefully implements this interface for a realistic benchmarking study would be a major help. We'd be able to share your dsc as an exemplar of how to use dscr
to organize one's code and data.
Thanks much for your work on this, Xiang! Sorry if I picked on you too much in the above - what you've done in your dsc already is just the same as the other dscr
users so far.
to clarify, i dont' see dscr as a tool for making research more reproducible. It is for making it more extensible. If it also helps with reproducibility that is great, but not the main emphasis.
And I'm fine with methods taking a pointer to a dataset (filename) rather than the dataset itself, and envisaged that usage when it was designed.
But I agree with the issue "the dark "off-the-books" portion of the benchmarking study is usually convoluted and almost never documented."
Regarding "My feeling is that mixing undocumented data-preparation with dscr results in something that is even less reproducible than performing a benchmarking study by hand but with a moderate quality of lab notes describing the process." - maybe. But it's the documentation that is the key in both cases. So I'd agree that "dscr does not obviate the need to document your work".
On Fri, Sep 11, 2015 at 12:26 PM, Raman Shah notifications@github.com wrote:
I think just having a full dsc that carefully implements this interface for a realistic benchmarking study would be a major help. We'd be able to share your dsc as an exemplar of how to use dscr to organize one's code and data.
Thanks much for your work on this, Xiang! Sorry if I picked on you too much in the above - what you've done in your dsc already is just the same as the other dscr users so far.
— Reply to this email directly or view it on GitHub https://github.com/stephens999/dscr/issues/59#issuecomment-139606815.
OK, I see.
One thought I had over the weekend about priorities, regarding the above, is that if strict invariants about reproducible data integrity are not going to be enforced, then parallelization is not important for the viability of dscr
- one can simply run and score the computationally costly methods manually with traditional cluster-based job submissions, put in stubs for those methods, and then point one's dsc to the scores after the fact. dscr
would still often be a help for organizing the presumably larger number of computationally trivial methods as well as sharing scores for collaborative benchmarking. There is a big cost in complexity and portability for any kind of parallel implementation, including one that goes through BatchJobs
, and it would be a big plus, in a way, if we can keep dscr
serial.
We could in fact present a tutorial for how to patch an external, manually executed set of results into a dsc with a vignette.
While I've made a lot of progress cleaning up the internals of dscr
to admit parallelization, it is still a little ways off, and the work contributes to the maintainability even of a serial implementation. Unless you disagree with this course of action, I'll shelve the BatchJobs
project for now and work on the rest of the punch list for readying the package. Making the lightweight "table of scores" ergonomically available on GitHub and adding logic to lock parts of the dsc so that collaborators can extend it without having your files would rise to the top of my own priority list.
Talking today with @xiangzhu and bringing his dsc up to date was illuminating. Like most others who use
dscr
in practice, he "cheats"dscr
by having his method wrappers point to hand-prepared datasets already pre-processed to drop into the various methods he want to test. Hisinput
objects are just fragments of filenames, and hiswrapper
functions simply assemble an input filename from theinput
fragment and the name of the method to be tested, and load the hand-prepared data residing in this file. This practice struck me as kind of smelly, and our discussion clarified my thoughts on why we claim thatdscr
is a tool for making research more reproducible. It boils down to users obeying and implementing a single kind of interface:To me it seems that the spirit of
dscr
is that each method gets handed a bitwise identical copy of actual data to process, and that a scientist new to a project will be able to audit for preprocessing errors, configuration choices, etc., by tracing the execution path preceding therun_dsc
verb. To date, I've yet to see a student do a real scientific application withdscr
that actually implements the above interface with rigor. And worse, the dark "off-the-books" portion of the benchmarking study is usually convoluted and almost never documented. My feeling is that mixing undocumented data-preparation withdscr
results in something that is even less reproducible than performing a benchmarking study by hand but with a moderate quality of lab notes describing the process.Thoughts? Suggestions for application with which we can convey the above somehow with a third vignette?