Closed armish closed 7 years ago
Looks great @armish, thanks so much. This will be super helpful.
Only thing I noticed is that this will not work for python <= 3.6 -- the AbstractContextManager class isn't available for earlier versions. It's a bit of a shame since we're using 3.5 for reticulate compatibility.
At any rate, should we make this requirement explicit for Cohorts to avoid errors with earlier versions of python? Or, is there a way to introduce this functionality optionally? Open to suggestions.
@jburos: oh, and I forgot to mention: cohorts
still doesn't know about this new addition. This is on purpose since we should better do the integration slowly and carefully and make sure the additional support doesn't bite us for the upcoming wrap-ups, especially during these days/weeks.
So, consider this as being 1/N of minor refactorings coming in and definitely a non-urgent PR to CR. Completely fine with me to keep stacking them up until Tavi comes back and the projects start running a little bit slower.
@armish nvm see that you fixed the abstractManagerClass issue. What do you think - should we start working in a develop
branch for the accumulation of these minor PRs?
Great stuff, @armish!
@jburos: This adds a really light-weight but functional module to the cohorts so that instead of dealing with all the NFS setups and remote machine admin work, we can directly make use of files on the bucket for relatively small tasks. I don't think this solution will scale up to BAMs, but VCFs should be fine to handle this way.
Here is a notebook that goes over the basic functionality really quick: https://gist.github.com/armish/b4994d51775a390fb1057d80501683d3
@timodonnell @ryan-williams @iskandr: I know TensorFlow has its own shady integration that is pretty smooth on the App/GPU Engines but doesn't work on our locals and I think Scala/Spark world is enjoying the DataFlow (Apache Beam) utilities for such integration, but let me know if there is a better way of doing this.