Closed iliat closed 8 years ago
Hi Ilia,
Looks really nice, but I don't have time to test it now - maybe a little later. Just a few things in the meantime, in order to get the dependencies to the latest versions:
https://github.com/iliat/dataflow-java/blob/master/pom.xml#L72 -> s/1.1.0/1.4.0
https://github.com/iliat/dataflow-java/blob/master/pom.xml#L117 -> s/v1beta2-rev25-1.19.1/v1-rev56-1.21.0
https://github.com/iliat/dataflow-java/blob/master/pom.xml#L130 -> s/v1beta2-0.36/v1beta2-0.39
https://github.com/iliat/dataflow-java/blob/master/pom.xml#L194 -> s/1.128/2.1.0
https://github.com/iliat/dataflow-java/blob/master/pom.xml#L210 -> s/3.0.0-beta-1/3.0.0-beta-2
A re-test after these changes might not be a bad thing to perform, just to be sure everything passes.
Let me know what you think.
Thanks, ~p
@iliat this looks really awesome. LGTM
Great speed up! I assume mvn verify passed all integration tests.
Nice cleanup on the file names too. Do you think some of this code is useful outside of the context of dataflow? If so, some other time it would be nice to move it elsewhere.
@iliat Everything I can understand looks good ;-)
@deflaux @pgrosu @jakeakopp Thanks for the review, I addressed most of the comments and will upload the version with these changes once I do a bit more validation beyond basic checks.
@deflaux - I can finally update the PR with a version that: 1) Addresses comment above 2) Fixes index generation bug and fixes a long lurking bug GCSSeekableStream 3) Adds a BAMDiff tool I used to compare BAM files exported with API and this method to ensure no reads are missed (it ignores unmapped ones for now).
Dude, there's some minor error in your JavaDoc - below is the link to the log for JDK8:
https://s3.amazonaws.com/archive.travis-ci.org/jobs/110838148/log.txt
Here's the start of the errors:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:javadoc (default-cli) on project google-genomics-dataflow: An error has occurred in JavaDocs report generation:
[ERROR] Exit code: 1 - /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:180: warning: no @param for log
[ERROR] public static void createIndex(SamReader reader, File output, Log log) {
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:10: error: unexpected text
[ERROR] * @see https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "58" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools
@pgrosu Yeah, on it :)
Ah, cool man :)
PTAL @deflaux
LGTM @iliat Again, really nice work here.
Please file issues for stuff to be done in future PRs. Thanks!!!
It implements sharded index writing and removes GBK stage in writing, relying on the read side sharding.
It exports a full genome BAM file of ~60GB in ~25min.