hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Google Storage file access add-on for cohorts (a.k.a. minibucket) #219

Closed armish closed 7 years ago

armish commented 7 years ago

@jburos: This adds a really light-weight but functional module to the cohorts so that instead of dealing with all the NFS setups and remote machine admin work, we can directly make use of files on the bucket for relatively small tasks. I don't think this solution will scale up to BAMs, but VCFs should be fine to handle this way.

Here is a notebook that goes over the basic functionality really quick: https://gist.github.com/armish/b4994d51775a390fb1057d80501683d3

@timodonnell @ryan-williams @iskandr: I know TensorFlow has its own shady integration that is pretty smooth on the App/GPU Engines but doesn't work on our locals and I think Scala/Spark world is enjoying the DataFlow (Apache Beam) utilities for such integration, but let me know if there is a better way of doing this.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-2.5%) to 54.427% when pulling 64a26a2003f152600aadf689bd736d41a923e506 on gcloud into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

jburos commented 7 years ago

Looks great @armish, thanks so much. This will be super helpful.

Only thing I noticed is that this will not work for python <= 3.6 -- the AbstractContextManager class isn't available for earlier versions. It's a bit of a shame since we're using 3.5 for reticulate compatibility.

At any rate, should we make this requirement explicit for Cohorts to avoid errors with earlier versions of python? Or, is there a way to introduce this functionality optionally? Open to suggestions.

armish commented 7 years ago

@jburos: oh, and I forgot to mention: cohorts still doesn't know about this new addition. This is on purpose since we should better do the integration slowly and carefully and make sure the additional support doesn't bite us for the upcoming wrap-ups, especially during these days/weeks.

So, consider this as being 1/N of minor refactorings coming in and definitely a non-urgent PR to CR. Completely fine with me to keep stacking them up until Tavi comes back and the projects start running a little bit slower.

jburos commented 7 years ago

@armish nvm see that you fixed the abstractManagerClass issue. What do you think - should we start working in a develop branch for the accumulation of these minor PRs?

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-2.5%) to 54.427% when pulling bf53812d0a5c4c2c4d7d547190804f7a439b936d on gcloud into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-2.5%) to 54.427% when pulling aa0fd4e786ada1f9820cc4bad1e2d3efd296c02e on gcloud into 31545c4b03d9a1edd9cc543ac1f863a140e88cd8 on master.

tavinathanson commented 7 years ago

Great stuff, @armish!