Closed jennybc closed 8 years ago
Hi @jennybc,
Thanks for checking in. The project is still very active. I just wrestled for a whole day to figure out how to fix the "pandoc stack space overflow" error in order to get an annotated heatmap to publish on RPub. It's huge, will take some time to load …
http://rpubs.com/myuen/weevilInd_heatmap_annotated http://rpubs.com/myuen/constDiff_heatmap_annotated
There was a change in @whiteh5 priority. He had to shift gear to work on 2 other manuscripts. One of which got recently accepted! The focus is now back to this one. I will let @whiteh5 to chime in regarding his timing to have this one write up.
I'll see you Wed at the DiagrammeR workshop!
Hi @jennybc,
As @myuen said, my priorities shifted a bit to finishing 2 other manuscripts before we tackled the white pine weevil project. The first 2 papers needed to be published before the white pine weevil project as they serve as the basis for this investigation. @myuen has done some really exciting analyses and helped to make this massive dataset very accessible. We have found some really interesting patterns and are finishing up with some targeted bench work looking into some of the genes of interest we identified in this study. I am now aiming for sometime in early 2016 to having a draft of this paper ready. It is a really large study and the RNA-seq analyses in the centrepiece. We also have biological, chemical, proteomic, qRT-PCR, and microscopy data on both the plant and the insect. We are shooting for a high impact journal for this so hence the time its taking to put it together.
We will probably need some assistance when it comes to writing out some of the methods and maybe the results although I know Mack has a good handle on everything. Joerg likes to have fairly mature manuscripts ready before he sends them off to co-authors, so hopefully you will have something to see in the new year!
Good to hear from you and I look forward to your feedback. I will be in contact with you if there are any issues that arise in the writing of the MS.
Best, Justin
Great to hear!
I am asking for two reasons:
This would help you guys demonstrate complete openness and reproducibility re: data and computation, at least for this differential expression analysis, so can only be positive.
That sounds cool! Let me know if I can help with the R package. Sounds like it's a wrapper?
This (public) thread is where a bunch of us started thinking out loud about this (related to a workshop we were all at):
https://github.com/ropensci/unconf/issues/31
And this document represents where we sort of ended up:
https://github.com/ropensci/rrrpkg
I'd be interested in seeing how well (or poorly) the set of analyses that made it into this repo can be packaged this way. Hence my desire to revisit the analyses and clean `em up. To be clear, I have NO desire to disrupt things by producing different results, just want to get a decent head start on structuring all of this so that it's ready at the same time as the manuscript. At some appropriate time/place, feel free to let Joerg know about this. I assume he'd be supportive of having such a companion repository for data and code (for a portion of what's in the paper). We can keep it private until publication and then flip a switch to public. Not sure what you mean by "high impact journal" but even Cell/Science/Nature seem to be gaining an appreciation of this.
I did planned to open the GitHub project and release the codes for the sake of reproduciblity anyway. Now it's just a matter of packaging it up.
I will be submitting the raw data to public repository soon and have it protected until the manuscript is accepted. I will have to figure out if the processed and cleaned data are too large to be upload to GitHub.
What are the individual file sizes and the total?
Largest file is 70MB or gzipped to 18MB. Limma results file is 67MB. I need to do some clean up to work out the total file size for all folders.
Up to 100 MB per file is ok on github I think. As long as they are very static files. I'm not sure at what point the total size starts to become a problem. Plus we could park large files elsewhere and make sure the pipeline document here grabs them and, once the intermediate final product sizes start to come down, start including those pieces in the repo. Pretty close to what we've got now in terms of philosophy.
1GB total seems to be a pretty hard limit.
Well, it depends how much of the whole analysis pipeline are we trying to capture here. And there are a few external dependencies as well.
I think it would be valuable to see how far we could push ourselves to do a really kick ass job of capturing as much as possible. As I said, I've sort of committed to trying to do this with one project and I pick this one. A beautifully packaged set of analyses like this has value as a showcase for best practices, beyond the novel biology. But we can also be realistic when deciding which bits are worth doing this for and which are not.
Happy New Year to you all! Hope you had a good holiday.
I think it's a good timing to bring this up again for discussion. I attended the R study group Docker tutorial last year and thought it might be a good platform to package up our project to achieve our goal in reproducibility. This will be my priority in the coming weeks in addition to reviewing and finalizing my codes and comments. Hope nothing major breaks with the release of ggplot2 2.0 over the holidays.
Looking forward to this!
What aspects of it do you think require Docker?
BTW here's a good blog posts on related matters:
Primarily, the ease of deployment with all software (i.e. R studio & Blast2Go) and codes included in one package across multiple platforms. The other is more for archival purpose that helps to preserve the version of all packages and dependencies that works with the codes released.
Got it.
I think it's important to meet people on various levels. Specifically, don't make Docker the only route to reproducing our stuff -- still include, for example, all the data and R scripts in very transparent forms, for people who just want to browse or pick and choose the bits of interest and already have the setup needed for their goals.
I don't think you mean this either -- but want to make sure we don't end up with something like "Step 1: Install Docker. Step 2: Get to play with our data or workflow."
In fact, that's what I was thinking of doing? Is it advisable not to go that route?
I'd advise against making Docker a pre-requisite. For me this feels like an instance of a common problem of, say, making stuff available only in some proprietary form. Like data in an Excel spreadsheet when it could be given as a plain text file (or as both). I'm all for the Docker stuff you propose but in addition to sharing scripts, data, etc. in the "least common tech denominator" way too.
It should barely be any extra work or none at all.
Do you see what I mean?
Totally. The Docker package will only be an alternative outlet to GitHub for those who do not already have a machine ready to go.
Hi @myuen!
Just saw the first commit in a while commit go by! How are you? How is the white pine weevil project going? Is it heading towards publication and is there anything I can help with to that end?