googlegenomics / dataflow-java

Google Cloud Dataflow pipelines such as Identity-By-State as well as useful utility classes.
Apache License 2.0
36 stars 31 forks source link

Update to handle API reading and optimize sharded writing and indexing #165

Closed iliat closed 8 years ago

iliat commented 8 years ago

It implements sharded index writing and removes GBK stage in writing, relying on the read side sharding.

It exports a full genome BAM file of ~60GB in ~25min.

pgrosu commented 8 years ago

Hi Ilia,

Looks really nice, but I don't have time to test it now - maybe a little later. Just a few things in the meantime, in order to get the dependencies to the latest versions:

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L72 -> s/1.1.0/1.4.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L117 -> s/v1beta2-rev25-1.19.1/v1-rev56-1.21.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L130 -> s/v1beta2-0.36/v1beta2-0.39

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L194 -> s/1.128/2.1.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L210 -> s/3.0.0-beta-1/3.0.0-beta-2

A re-test after these changes might not be a bad thing to perform, just to be sure everything passes.

Let me know what you think.

Thanks, ~p

deflaux commented 8 years ago

@iliat this looks really awesome. LGTM

Great speed up! I assume mvn verify passed all integration tests.

Nice cleanup on the file names too. Do you think some of this code is useful outside of the context of dataflow? If so, some other time it would be nice to move it elsewhere.

jakeakopp commented 8 years ago

@iliat Everything I can understand looks good ;-)

iliat commented 8 years ago

@deflaux @pgrosu @jakeakopp Thanks for the review, I addressed most of the comments and will upload the version with these changes once I do a bit more validation beyond basic checks.

iliat commented 8 years ago

@deflaux - I can finally update the PR with a version that: 1) Addresses comment above 2) Fixes index generation bug and fixes a long lurking bug GCSSeekableStream 3) Adds a BAMDiff tool I used to compare BAM files exported with API and this method to ensure no reads are missed (it ignores unmapped ones for now).

pgrosu commented 8 years ago

Dude, there's some minor error in your JavaDoc - below is the link to the log for JDK8:

https://s3.amazonaws.com/archive.travis-ci.org/jobs/110838148/log.txt

Here's the start of the errors:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:javadoc (default-cli) on project google-genomics-dataflow: An error has occurred in JavaDocs report generation:
[ERROR] Exit code: 1 - /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:180: warning: no @param for log
[ERROR] public static void createIndex(SamReader reader, File output, Log log) {
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:10: error: unexpected text
[ERROR] * @see https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "58" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools
iliat commented 8 years ago

@pgrosu Yeah, on it :)

pgrosu commented 8 years ago

Ah, cool man :)

iliat commented 8 years ago

PTAL @deflaux

deflaux commented 8 years ago

LGTM @iliat Again, really nice work here.

Please file issues for stuff to be done in future PRs. Thanks!!!