richfitz / wood

How much of the world is woody?
http://richfitz.github.io/wood
Other
20 stars 9 forks source link

Citations to software? #11

Closed cboettig closed 10 years ago

cboettig commented 10 years ago

Hi @richfitz @mwpennell and co,

Scott mentioned this project to me -- looks incredibly awesome; hats off to you for such an impressive illustration of reproducible research. The travis integration in particular is an elegant proof-of-principle!

Given your great care in almost all other aspects including citation of data, I'm a bit surprised to see you're rather more reserved in providing any citations to any of your software dependencies (with version numbers, etc, including software you yourselves have written and published on), unless I have overlooked it somewhere?

mwpennell commented 10 years ago

Good point. We should address this. Do you have any suggestions as to how best do this (i.e. where/how should we cite these?)

cboettig commented 10 years ago

I think it's a bit of an open question; your collective ideas are no doubt better than mine.

From a reproducibility standpoint, I think you should consider including some documentation of all that sessionInfo() returns as part of the README at least. My expectation is that one day the travis check will start to fail do some downstream dependency changing (which is one reason why I'm fascinated by the travis setup you have to begin with).

From a credit standpoint this perhaps gets more subjective, but you might at least cite @richfitz MEE diversitree paper ;-) R will of course generate bibtex material for all the R packages you depend upon, including any pubs listed in the CITATION file. I say it's an open question, because obviously your complete dependency tree can get rather large rather quickly, and from a reproducibility/provenance standpoint, everyone can see that diversitree depends on ape, for instance. But from a credit / encouraging sharing of other research outputs standpoint, it seems fair to give Emmanuel one more citation to push him to the 2K mark; given that you're already careful to cite Dryad data alongside those publications.

Personally I find the latter a somewhat unsatisfactory hack, because no one is going to be thorough about citing software dependencies, so the numbers will always be underestimates, and it's a rather crude way to go about it. But until we have a better alternative, I suppose the software dependencies are at least as worthy of a citation as any other citation.

sckott commented 10 years ago

@richfitz @mwpennell Great work y'all! Hope you don't mind I mentioned to @cboettig :)

On the package versioning topic: Attempting to have reproducible papers seems like a perfect use case for a node style package management system (this is just future-think, but anyway...), wherein all versions of R packages (all the way down the dep tree) used in the final analysis are stored locally in the repo, then the user clones the repo and runs make to reproduce the paper, ensuring that it works . Whereas e.g., a year from now some packages API's may break this flow.

richfitz commented 10 years ago

The citation thing is something we did discuss (#6, though offline mostly from the look of it).

The issue that we have is that while we use a bunch of packages they are all used really peripherally. Diversitree is probably the package we use the most, and I'm quite happy forgoing a citation here (or similarly if someone else used it in a similar way). We actually looked at getting rid of the dependency (#5) because it's a pain to install on OSX at the moment.

But that's all for the actual paper. I think the point of integrating the software used more explicitly into the analysis is great. Here (for my records at least) is sessionInfo() on my local machine:

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] diversitree_0.9-7 Rcpp_0.10.6.7     subplex_1.1-3     ape_3.0-11       
[5] deSolve_1.10-6    knitr_1.5        

loaded via a namespace (and not attached):
 [1] boot_1.3-9      codetools_0.2-8 digest_0.6.4    evaluate_0.5.1 
 [5] formatR_0.10    grid_3.0.2      lattice_0.20-23 nlme_3.1-111   
 [9] stringr_0.6.2   tools_3.0.2    

I'll add this output at the end of the analysis (#13) as a record.

richfitz commented 10 years ago

The next issue is downstream changes breaking the package. I think that this is a hard problem to get right. We could use packrat to manage the dependencies, but I'm not generally a big fan of this approach - it seems sketchy to me.

This project is small enough that the number of things we depend on isn't that bad and hopefully won't change that much. But if we used packrat or something similar you wouldn't know which packages work with the current version. In theory if we're only using API things and everyone is good at updating their packages carefully there would be no problem.

So perhaps a solution is for the "final" version document all the dependencies with packrat and then set up a build matrix on Travis that builds against the current versions of everything and the archived versions.

richfitz commented 10 years ago

Which brings me to the final issue: This stuff is hard. We all glibly talk about the need for reproducibility on twitter, etc, but actually trying to get something that runs on more than one computer in a vaguely repeatable way is really tricky. We're a bunch of nerds that get excited by this stuff, but the average field biologist just wants to get back into the field. Sorting this out properly cost us a couple of days for sure, and we can't as a field require that everyone does it. So the tools have to get way easier.

mwpennell commented 10 years ago

Amen. If I learned anything in this process is that the tools to do this right and easily just aren't available (and certainly not widely accessible). We are just making this stuff up as we go along.

sckott commented 10 years ago

Agreed @richfitz - The tools have to get wayyyyyyy easier. One challenge is figuring out the tools to make this all happen, and another is bringing that in an easy to use interface to where the average biologist works, e.,g,. in R.

This will be an interesting test case to see how long the automated builds on Travis still work as package versions change on CRAN. Since you don't' have that many dependencies, perhaps should work for a long while.

cboettig commented 10 years ago

@richfitz Thanks for the replies, excellent points all around.

Yeah, I'm pretty impressed how light your dependency footprint really is, particularly after excluding those used only for the reproducibility framework (RCurl, knitr). Nonetheless I would find it hard to argue that a citation to diversitree, and possibly to methods it is providing through deSolve and/or ape would be inappropriate. Whether or not the citation to the diversitree paper and software matter to you personally, the community as a whole benefits from properly documenting and acknowledging citations of this material, as well as examples of that practice. Citations to software are also what most of us rely on for version information, even when that information can be found elsewhere.

Of course I couldn't agree more that this style of reproducibility is way too hard (I've tried myself, though before travis, in just providing my code as an R package in which the paper is a vignette. Though this works in principle, dependency decay meant that the NESCent informatics team couldn't manage to install and run the code when they tried to do so only a year later!) to reach widespread adoption, even though it has gotten easier as @sckott says. But I think examples like yours serve as experiments that help us figure out where the pain points are and what can be made easier. I agree that there's huge scope for tool improvement to make this more routine.

I was actually meaning to ask you about just how much overhead and pain it was to you and your collaborators to go the extra miles on reproducibility here. (While not a perfect analogy, it puts some expectation on how much a reviewer or reader would have to go to reproduce the results as well). Then I realized it was surprisingly easy for me to get a sense of this from the repo history. I was impressed that the project hadn't just been put on a repo when nearly finished, and it was easy to see where the travis integration step started, etc.

This raises the related but separate issue to the reproducibility question is the research provenance, in which this paper also excels though you seem to emphasize the reproducibility more. To some readers the ability to understand how your approach evolved or who did what may matter more than the ability to regenerate the figures from a single command.

I'd still be curious to hear what parts you find to be the most difficult in doing this, why you've organized the repo the way you have (why not an R-package format, a la Gentleman and Temple Lang's 'research compendium' -- wouldn't that make installation and travis deployment a bit easier?) and what you learned from it.

richfitz commented 10 years ago

In general @cboettig, I agree with the idea of citing software wherever possible. But we only use ape and diversitree (and their respective dependencies) to draw a single tree figure -- one that a reviewer suggested removing because it's not really central to the paper.

However, the dependencies are now more explicitly recorded, both using sessionInfo at the bottom of the analysis and also using packrat. See the packrat.lock file, which is parseable with base::read.dcf. I also have added a make deps target that parses this file to install required dependencies.

We would have started the travis integration earlier, but it requires open repos. I'm a fan though, and I think that continuous integration from the beginning of a research project could potentially save a huge amount of hassle. However, the long "build times" that a nontrivial analysis would take would probably require a self-hosted or paid set up.

Things that were hard: mostly not knowing where we were going. We directly copied ideas from this for Matt's model adequacy work (paper and package). The biggest issue I'd like to see solved is a way of depending transparently on data on Dryad without having to redownload every time. This is being worked on by various people at the moment. Packrat for the package dependencies was also a pain, because the design doesn't quite map onto what I wanted. I may be a control freak though.

Finally: Why not a package? I don't think that packages make a great template for the analysis, though they do for library code. In my experience, people get confused by the directory structure. Also, most of the code in the analysis is straight up instructions (do this, do that, then do this) that I find easiest to write depending on a single global set of data. I'd have gone for a package if the analysis was really nontrivial, depended on compiled code or if I didn't want the guts of the analysis explicitly in the script. Getting travis integration for non-package code was actually really easy, using this recipe.

cboettig commented 10 years ago

@richfitz thanks! Yes, you make a completely convincing case about the citations, and had I noticed issue #5 earlier I wouldn't have mentioned it. From a glance over the code it isn't always easy to tell how important the dependencies are or are not to the analysis.

Yeah, I've not found packrat to be exactly what I would do either, but maybe because I haven't spent enough time with it.

I'm not sure that you lost much by not having travis at the beginning, given that one effectively doing all those runs locally at the time, though it's an intriguing idea. I'm particularly interested in travis in catching dependency problems, which have been a bit of a bugaboo with some old code I'd attached with earlier papers, (e.g. https://github.com/cboettig/prosecutors-fallacy/pull/1), though perhaps something you'll avoid more easily by the light dependency footprint.

Thanks for the explanation about the package vs non-package format as well. I agree that it can easily be overkill for certain situations, though at first glance the existence of an R directory, data directory, docs directory (~vignette) a Makefile, and so forth seem not less complex and a bit less intuitive to a user already familiar with the R package mechanism, and opens the door to reuse certain bits and pieces like automatically handling dependencies (e.g. the points made for this use case by http://doi.org/10.1198/106186007X178663 ).

Anyway, thanks again for the replies and hope I haven't been a nuisance

On Sun, Apr 13, 2014 at 7:25 PM, Rich FitzJohn notifications@github.comwrote:

Closed #11 https://github.com/richfitz/wood/issues/11.

Reply to this email directly or view it on GitHubhttps://github.com/richfitz/wood/issues/11 .

Carl Boettiger UC Santa Cruz http://carlboettiger.info/