Maksimovic: Data availability

grimbough commented 8 years ago

Having read the manuscript, here's my understanding of the data you're working with. Please let me know if I've got this wrong, as the rest of my comments might change.

You're working from IDAT files
There are 11 samples (22 IDAT files), which should be ~100MB of space.
These aren't available from GEO, only the extracted intensities

Assuming that's correct, I would propose that we stick to the simplest option, and just include the IDAT files and the sample sheet in the workflow package. Since it's a package R will always have some concept of where it's installed and it's easy to reference the data using system.file().

100MB isn't a large download, so I think it's nicer to work from the initial data, rather than save the RGChannelSet as an rda and load() it. I think this is also preferable to a separate data package since the samples come from two different experiments and publications.

If the IDAT data were on GEO, then I would suggest at least including an example of how to obtain them from there, but if that isn't the case, then it's not needed.

JovMaksimovic commented 8 years ago

We actually use 2 separate datasets in the workflow. The first is the one you mentioned (GSE49667) which is 11 samples and ~150MB. The second dataset is used later in the workflow (GSE30870) and is comprised of 39 samples or ~600MB. As you noted, the IDAT files are not available from the GEO record. However, the raw fluorescence intensities and detection p-values are available as gzipped text files (about 25 and 95MB, respectively), as supplementary files. I'm thinking of perhaps using download.file to download these from the ftp URL in R (can do a check to make sure it itsn't already downloaded) and then I can read the fluorescence data into R from the downloaded txt.gz file and create a MethylSet from data table. The rest of the workflow can then proceed as before. Thoughts?

grimbough commented 8 years ago

In the world of Netflix and Spotify, I personally don't think a download of 750MB is that much, especially if I'm downloading it only once and it's a resource I'm hoping to use a few times, which a tutorial may well be. Several of the existing annotation & experimental data packages are larger than this, although not many.

Maybe including the IDAT files for the smaller first dataset with the workflow allows you to demonstrate (and keep testing) code to read the raw data. You could then distribute and load the object that results from read.450k.exp() for the second dataset, and just make it clear in the text that you're doing this to save space for the purpose of the example. This of course assumes the R object is smaller than the raw IDATs.

Alternatively, making the second dataset an example of downloading the intensities from GEO and creating a MethylSet from there sounds like it might be a very useful thing to include. For me covering a variety of ways people may encounter data is always good, and having evaluated code chunks so you know if something you rely on alters its behaviour is crucial.

seandavi / F1000R_BiocWorkflows

Maksimovic: Data availability #2