Some harder problems to tackle at the hackathon

karthik commented 10 years ago

Based on email discussions with @cboettig and @sckott, we've come up with a few harder problems that might be worth tackling with such a fantastic group assembled in one place. If these are worth pursuing, let's break them up into smaller issues.

Sustainable software and how to provide a stable ecosystem of R packages Problem: The issue with dependencies. CRAN only provides latest versions of packages and older archives are unreliable and not guaranteed to be installable and there is no guarantee it won't break existing code. Would be great to have thoughts or ideas, especially from folks like @hadley / @jjallaire on how to deal with this.

Ensuring Interoperability Interoperability between ropensci data sources: We come back to this theme frequently, but have yet to work this out. Our efforts match the data providers model wherever possible, but it would be great to allow researchers to seamlessly work with data from anywhere (e.g. climate data, time series data, species occurrence data). Ideas along the lines of dplyr's philosophy would be great and most welcome.

Caching data While retrieving data from APIs are fantastic (and the core functionality behind most rOpenSci's packages), API's can disappear, and data can change, both of which can affect reproducibility. Similar to RStudio's packrat, it would be great to consider ideas to cache/snapshot timestamped API calls along with code and narrative.

Reproducibility I think one thing we should really tackle, if possible, is the issue of reproducibility. Outside of our expert/super user bubble, regular scientists rarely use the suite of tools that we rely on every day. There are hardly any papers that are simply .Rmds to reproduce entire papers (sans journal's own style).
What are those roadblocks and what parts of that pipeline can we streamline with the higher levels tools that Hadley is known to write?

dtrapezoid commented 10 years ago

I think this is a great summary of issues for the hackathon.

Organizationally geared question here: Are we planning on having teams of participants address each of these issues based on knowledge base/preference or shall we have more of a free for all approach?

jhollist commented 10 years ago

The Reproducibility problem strikes home for me.

I am a recent convert to the R Markdown/knitr/pandoc/makefile tool set and am quite enamored of it; however, many of my colleagues often point out that I am the unusual one and in spite of my proselytizing they are very unlikely to switch away from using Word. We could certainly make some progress by continuing to encourage others to try R Markdown/knitr/... and incorporating the same into undergraduate and graduate education, but that means we are at least a generation away from seeing significant changes.

I wonder if this group could make some progress towards making the existing tool set used by most scientists more reproducible. It seems that tackling reproducibility from the MS Word side could have the greatest impact. This would be similar to the way the DataUP project approached the problem of trying to get more scientists to better manage data and submit to DataONE. They worked with Microsoft Research and developed DataUp to work directly with Excel and have since moved most of that directly to the cloud. Not sure I am suggesting that route, just using it as a somewhat relevant example.

Given that I am most certainly on the extreme novice side of development (even more so with this group!) I have no ideas on how we might develop something for Word that could make it part of a reproducible workflow or even if that is possible. But seeing that Word and Office in general are moving to the cloud it seems like incorporating reproducible analysis via something like OpenCPU is more feasible than it ever has been (forgive me if I am talking nonsense here).

In any event, having been reminded multiple times from several of my co-workers that they just aren't going to spend the time hacking that I do, it seems that to really increase the reproducibility of science we need to address the problem where much of that science is actually happening. And unfortunately, that isn't exclusively with tools we think of as reproducible (i.e R, python, etc.).

Cheers, Jeff

eduardszoecs commented 10 years ago

I think one thing we should really tackle, if possible, is the issue of reproducibility. Outside of our expert/super user bubble, regular scientists rarely use the suite of tools that we rely on every day. There are hardly any papers that are simply .Rmds to reproduce entire papers (sans journal's own style).

I personally don't like knitr or sweave to write research papers. However, I still can maintain reproducible research.

Some thoughts:

I like LaTeX more the markdown. It's a little more complicated, but gives you much much more flexibility. Also some journals accept LaTeX format, though many insist of submitting a .doc file (grr....)
For a research paper code changes a lot and develops with time. With many code not used in the final paper. Same with text. Therefore I separate both.

My workflow/structure/setup is something like this: Folder structure:

/data -> raw data files
/cache -> cache intermediate files (eg. after cleaning)
/src -> R Home
/report -> Latex home.
and perhabs some others....

For R projects I follow Rob Hyndman:

/src/load.R
/src/functions.R
/src/clean.R
/src/do.R
and perhabs some others....

All the paths are setup in load.R (and used via file.path()).

R Code produces Figures which are then stored into /report/fig, and then are included into the LaTeX file. If I change anything in my Code the figures are also updated in LaTeX.

Generally I first develop code and then write the paper. Having R and LaTeX separated I can first develop Code and if it's finished write the paper. I can change some code and then compile the LaTEx doc and have an updated version with the new results.

If I publish, I just put my folder in the supplement and reproducers just need to change one path in load.R (But this is also explained in a README file).

So this is my workflow explained in a few lines, hope it is understandable. For research paper I would not want to miss the functionality of LaTeX. Markdown is easy, but in my opinion not flexible. It think with this workflow I can also ensure reproducible research (in the sence of scientific publications) without using knitr/sweave.

I am interested about your thoughts regarding my workflow and your experience in writing scientific papers with markdown...

mfenner commented 10 years ago

Ensuring Interoperability As the developer behind one of the APIs that rOpenSci is using (alm) I would be interested in an interoperability discussion. There is the client tools side to it, i.e. how you write the R package, but there is also the API side, i.e. what are best practices that we can recommend (and that I would be happy to incorporate). Even a very short list of the latter would be a good start, and could be based on some of the major pain points you have writing R libraries to talk to those APIs.

Reproducibility I'm not very interested extending Microsoft Word or Excel - only to the extend that I can import/export from the tools I use. For me reproducibility is very much linked to automation, and I just don't see how this can be easily done in those applications. Markdown, Github, Pandoc, Travis, etc. might look geeky now, but I'm happy to go that direction.

karthik commented 10 years ago

ok, it looks like it might be better to split these up. I'll do that now and reference your comments so far.

ropensci / unconf14

Some harder problems to tackle at the hackathon #18