ropensci / rrrpkg

Use of an R package to facilitate reproducible research
255 stars 25 forks source link

Multi-experiment studies #5

Open tmalsburg opened 9 years ago

tmalsburg commented 9 years ago

If a study consists of multiple experiments, how should the data and materials be structured? The most natural way would be to have a directory for each experiment but that goes counter the approach proposed here. But if the files for each experiment are scattered across the various directories (data, R, analysis, ...), it might make sense to have some sort of naming convention, e.g:

benmarwick commented 9 years ago

Yes, good question, thanks for your comments. As I noted in #4, it would be great to have an actual example of someone making a good attempt at this. Maybe it could be your next publication? :) Or one you've already got out?

There are no hard-and-fast rules (yet), what we're doing here is mostly just looking to see how people are already solving these problems for their own research in ways that can be generalised to be useful and practical for many other researchers (rather than trying to be too prescriptive and disconnected from the norms of practice)

tmalsburg commented 9 years ago

I agree that this proposal shouldn't be too prescriptive. However, if I understand correctly, one goal is to be compatible with R's package structure (so that compendia can be installed like ordinary R-packages). Doesn't that mean that we inherit many of the conventions described in (the notorious) "Writing R extensions"?

Having one directory for each experiment makes sense in the project that I'm currently working on, but it would not conform with R's package structure which requires that data is stored in the top-level data directory.

benmarwick commented 9 years ago

That's a good question, one relevant factor is if you goal is to have the package on CRAN on not. For me, I don't think I'll ever submit any of my research compendia packages to CRAN, so I don't feel too bound to the manual. I'm happy with the minimum to allow the package to build and don't mind some warnings and notes. Others may have different standards and goals for their compendia, and I'm keen to see what standards emerge in others' work.

Regarding multi-experiments, I might organise my compendium something like this:

project
|- DESCRIPTION          
|- README.md             
|- NAMESPACE           
|- LICENSE                  
|
|- data/                      
|  +- exp_1/
|       + my_exp1_data.csv
|       + README.md    
|  +- exp_2/
|       + my_exp2_data.csv
|       + README.md   
|  +- exp_3/
|       + my_exp3_data.csv
|       + README.md  
|
|- analysis/           
|  +- my_report.Rmd    
|  +- exp_1_analysis.R
|  +- exp_2_analysis.R
|  +- exp_3_analysis.R
|
|- R/                    
|  +- my_functions.R    
|
|- man/
|  +- my_functions.Rd   

But that might not make sense for your project, I don't know. I'd be curious to know what structure you use for your multi-experiment project. Would you mind to post it here when you've done it?

To include things like experimental materials (cf #4), you could either put them in a directory in inst/ or just in a top level directory like experimental_materials, similar how some of us have a manuscript directory. That second option is a non-standard extension to the classic R package structure (and probably wouldn't be allowed on CRAN), but it seems to make sense and packages with these extra directories still work as installable objects.

tmalsburg commented 9 years ago

My study is in progress and unfortunately I can't share the package at this stage. The structure is the following:

├─ README.org
├─ DESCRIPTION
├─ R
│  ├─ geometric.functions.R
│  ├─ ordered_plots.functions.R
│  └─ waic.functions.R
├─ Experiment1
│  ├─ stimuli.txt
│  ├─ presentation.py
│  ├─ results.csv
│  ├─ read_data.functions.R
│  ├─ inspect_raw_data.script.R
│  ├─ participants.csv.gpg
│  ├─ descriptive_stats.org
│  └─ analysis.script.R
├─ Experiment2
│  └─ …
├─ Experiment3
│  └─ …
└─ Manuscript
   └─ manuscript.org

At the top level we have:

Within Experiment1:

The other experiment directories are similar.

That’s what I have so far. Work in progress. What I like very much about this approach is that the directory structure reflects the structure of the study. That would not be the case if I would adopt the R-package approach.

gmbecker commented 9 years ago

Titus,

My major objection to this approach, which for me is a deal breaker, is that you lose a ton of the benefits of a unified structure. Your directory structure makes perfect sense to a human looking at it, but it is difficult to impossible to compute on and even where not impossible I would argue it is pretty far from optimal.

What should data(stimuli) do if your analysis package is loaded? Where should R look for the data? How can R even know what data is available? That is the reason that data lives in one of a few different places/forms. Because that guarantees that R can find it and give it to the user for any package loaded in the session. That doesn't seem like it is the case here without a lot of pretty ugly hacks ("just grep for directory names that start with 'Experiment' and look at all the files in there every time...' ").

A naming scheme within the data/ directory is much more reasonable from a tooling perspective, or at the very least, an extension thereof with directories under data/ (I'd have to look at how that might work, though).

Speaking of such extensions though, I do think it might be reasonable to write the "spec" in such a way that the analysis package can contain either individual .R/.Rmd/.org/etc files OR subdirectories of such files grouped together. Since we are defining what an analysis package does we can say that, for example, if no top-level files are present, the directories define "subanalyses" (a formal notion we would come up with).

Best, ~G

On Wed, Jun 3, 2015 at 8:49 PM, Titus von der Malsburg < notifications@github.com> wrote:

My study is in progress and unfortunately I can't share the package at this stage. The structure is the following:

├─ README.org ├─ DESCRIPTION ├─ R │ ├─ geometric.function.R │ ├─ ordered_plots.function.R │ └─ waic.function.R ├─ Experiment 1 │ ├─ stimuli.txt │ ├─ presentation.py │ ├─ results.csv │ ├─ read_data.functions.R │ ├─ inspect_raw_data.script.R │ ├─ participants.csv.gpg │ ├─ descriptive_stats.org │ └─ analysis.script.R ├─ Experiment 2 │ └─ … ├─ Experiment 3 │ └─ … └─ Manuscript └─ manuscript.org

At the top level we have:

  • README.org: a literate org file (similar to R-markdown). Github understands org, so this file is nicely rendered.
  • DESCRIPTION: as defined in Writing R Extensions.
  • R: contains general-purpose R functions as in the current proposal.

Within Experiment1:

  • stimuli.txt
  • presentation.py: the script used for presenting the stimuli during the experiment.
  • results.csv: the data.
  • read_data.functions.R: experiment-specific functions for reading raw and cleaned-up data. Used in inspect_raw_data.script.R, descriptive_stats.org, and analysis.script.R.
  • inspect_raw_data.script.R: generates a series of plots for screening the raw data.
  • participants.csv.gpg: contains participant information and for each participant a flag indicating whether or not their data should be included in the analysis. This file is encrypted because at the current stage it contains sensitive information. This will change once we publish the repository.
  • descriptive_stats.org: a literate org file. Would probably go to vignettes if I’d follow the R-package style (but there is no vignette engine for literate org, hm …).
  • analysis.script.R: inferential stats. Will be converted to a literate org file at a later stage.

The other experiment directories are similar.

That’s what I have so far. Work in progress. What I like very much about this approach is that the directory structure reflects the structure of the study. That would not be the case if I would adopt the R-package approach.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rrrpkg/issues/5#issuecomment-108708989.

Gabriel Becker, PhD Computational Biologist Bioinformatics and Computational Biology Genentech, Inc.

tmalsburg commented 9 years ago

@gmbecker, I'm fully aware that data(...) doesn't work in my approach. That's why I contrasted it with the "R-package approach" and that's why I have the file read_data.functions.R which contains functions that fill in for data. One basic fact that I have to acknowledge is that many of my colleagues are not R-hackers. They are Python-, Julia-, or Matlab-hackers, or, more likely, no hackers at all. Given that this is my audience, human readability is something that I would not like to give up easily. Having said that, I do see the appeal of being able to install a compendium and to be able to use R's package infrastructure, but to me this seems like a nice-to-have convenience not an essential requirement. Please note that I do not propose adoption of my approach in the context of the current effort. I just responded to Ben's request above.