Hello! - Githubissues

jennybc commented 9 years ago

Hi @myuen!

Just saw the first commit in a while commit go by! How are you? How is the white pine weevil project going? Is it heading towards publication and is there anything I can help with to that end?

myuen commented 9 years ago

Hi @jennybc,

Thanks for checking in. The project is still very active. I just wrestled for a whole day to figure out how to fix the "pandoc stack space overflow" error in order to get an annotated heatmap to publish on RPub. It's huge, will take some time to load …

http://rpubs.com/myuen/weevilInd_heatmap_annotated http://rpubs.com/myuen/constDiff_heatmap_annotated

There was a change in @whiteh5 priority. He had to shift gear to work on 2 other manuscripts. One of which got recently accepted! The focus is now back to this one. I will let @whiteh5 to chime in regarding his timing to have this one write up.

I'll see you Wed at the DiagrammeR workshop!

whiteh5 commented 9 years ago

Hi @jennybc,

As @myuen said, my priorities shifted a bit to finishing 2 other manuscripts before we tackled the white pine weevil project. The first 2 papers needed to be published before the white pine weevil project as they serve as the basis for this investigation. @myuen has done some really exciting analyses and helped to make this massive dataset very accessible. We have found some really interesting patterns and are finishing up with some targeted bench work looking into some of the genes of interest we identified in this study. I am now aiming for sometime in early 2016 to having a draft of this paper ready. It is a really large study and the RNA-seq analyses in the centrepiece. We also have biological, chemical, proteomic, qRT-PCR, and microscopy data on both the plant and the insect. We are shooting for a high impact journal for this so hence the time its taking to put it together.

We will probably need some assistance when it comes to writing out some of the methods and maybe the results although I know Mack has a good handle on everything. Joerg likes to have fairly mature manuscripts ready before he sends them off to co-authors, so hopefully you will have something to see in the new year!

Good to hear from you and I look forward to your feedback. I will be in contact with you if there are any issues that arise in the writing of the MS.

Best, Justin

jennybc commented 9 years ago

Great to hear!

I am asking for two reasons:

I am truly willing to help w/ polishing analysis or figures or interactives or whatever. I am teaching alot right now, so December / January would be ideal for me to pitch in again.
I have a side interest in converting the bits of analysis I have helped with into an R package. Not a true R package in the usual sense of providing functions. Rather, this is an experiment I've agreed to participate in: some R nerds are experimenting with how to repurpose the R package infrastructure for documenting an analysis. This is win-win I think, because this piece of the analysis will be even more beautifully packaged and documented for anyone when I do this. A companion for your paper. But I can't do that overnight, so I'd want some warning for the most logical time to start working on that conversion. I would probably refactor some code, i.e. produce exactly the same results and figs but with the "best" code I can muster. That's because I would actually be drawing the eyeballs of R gurus onto the project and I want us to look good. :smile:

This would help you guys demonstrate complete openness and reproducibility re: data and computation, at least for this differential expression analysis, so can only be positive.

myuen commented 9 years ago

That sounds cool! Let me know if I can help with the R package. Sounds like it's a wrapper?

jennybc commented 9 years ago

This (public) thread is where a bunch of us started thinking out loud about this (related to a workshop we were all at):

https://github.com/ropensci/unconf/issues/31

And this document represents where we sort of ended up:

https://github.com/ropensci/rrrpkg

I'd be interested in seeing how well (or poorly) the set of analyses that made it into this repo can be packaged this way. Hence my desire to revisit the analyses and clean `em up. To be clear, I have NO desire to disrupt things by producing different results, just want to get a decent head start on structuring all of this so that it's ready at the same time as the manuscript. At some appropriate time/place, feel free to let Joerg know about this. I assume he'd be supportive of having such a companion repository for data and code (for a portion of what's in the paper). We can keep it private until publication and then flip a switch to public. Not sure what you mean by "high impact journal" but even Cell/Science/Nature seem to be gaining an appreciation of this.

myuen commented 9 years ago

I did planned to open the GitHub project and release the codes for the sake of reproduciblity anyway. Now it's just a matter of packaging it up.

I will be submitting the raw data to public repository soon and have it protected until the manuscript is accepted. I will have to figure out if the processed and cleaned data are too large to be upload to GitHub.

jennybc commented 9 years ago

What are the individual file sizes and the total?

myuen commented 9 years ago

Largest file is 70MB or gzipped to 18MB. Limma results file is 67MB. I need to do some clean up to work out the total file size for all folders.

jennybc commented 9 years ago

Up to 100 MB per file is ok on github I think. As long as they are very static files. I'm not sure at what point the total size starts to become a problem. Plus we could park large files elsewhere and make sure the pipeline document here grabs them and, once the intermediate final product sizes start to come down, start including those pieces in the repo. Pretty close to what we've got now in terms of philosophy.

jennybc commented 9 years ago

1GB total seems to be a pretty hard limit.

myuen commented 9 years ago

Well, it depends how much of the whole analysis pipeline are we trying to capture here. And there are a few external dependencies as well.

jennybc commented 9 years ago

I think it would be valuable to see how far we could push ourselves to do a really kick ass job of capturing as much as possible. As I said, I've sort of committed to trying to do this with one project and I pick this one. A beautifully packaged set of analyses like this has value as a showcase for best practices, beyond the novel biology. But we can also be realistic when deciding which bits are worth doing this for and which are not.

myuen commented 8 years ago

Happy New Year to you all! Hope you had a good holiday.

I think it's a good timing to bring this up again for discussion. I attended the R study group Docker tutorial last year and thought it might be a good platform to package up our project to achieve our goal in reproducibility. This will be my priority in the coming weeks in addition to reviewing and finalizing my codes and comments. Hope nothing major breaks with the release of ggplot2 2.0 over the holidays.

jennybc commented 8 years ago

Looking forward to this!

What aspects of it do you think require Docker?

BTW here's a good blog posts on related matters:

http://billmills.github.io/blog/full-stack/

myuen commented 8 years ago

Primarily, the ease of deployment with all software (i.e. R studio & Blast2Go) and codes included in one package across multiple platforms. The other is more for archival purpose that helps to preserve the version of all packages and dependencies that works with the codes released.

jennybc commented 8 years ago

Got it.

I think it's important to meet people on various levels. Specifically, don't make Docker the only route to reproducing our stuff -- still include, for example, all the data and R scripts in very transparent forms, for people who just want to browse or pick and choose the bits of interest and already have the setup needed for their goals.

I don't think you mean this either -- but want to make sure we don't end up with something like "Step 1: Install Docker. Step 2: Get to play with our data or workflow."

myuen commented 8 years ago

In fact, that's what I was thinking of doing? Is it advisable not to go that route?

jennybc commented 8 years ago

I'd advise against making Docker a pre-requisite. For me this feels like an instance of a common problem of, say, making stuff available only in some proprietary form. Like data in an Excel spreadsheet when it could be given as a plain text file (or as both). I'm all for the Docker stuff you propose but in addition to sharing scripts, data, etc. in the "least common tech denominator" way too.

It should barely be any extra work or none at all.

Do you see what I mean?

myuen commented 8 years ago

Totally. The Docker package will only be an alternative outlet to GitHub for those who do not already have a machine ready to go.

myuen / White_Pine_Weevil_DE

Hello! #10