samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

Output filenames containing the string ".bcf" result in BCF output instead of VCF #761

Open jrandall opened 7 years ago

jrandall commented 7 years ago

Subject of the issue

Output filenames containing the string ".bcf" anywhere in the filename result in BCF output when VCF output is desired. For example, if one has filenames containing an md5 offset by dots, and an md5 happens to start with "bcf" then you can wind up with unexpected output format.

Expected behaviour

An output filename such as library.bcfce04b6b0fba5bd286ed9a1c9ebfe1.g.vcf.gz should produce VCF output.

Actual behaviour

An output filename such as library.bcfce04b6b0fba5bd286ed9a1c9ebfe1.g.vcf.gz produces BCF output.

jrandall commented 7 years ago

Relevant code is here: https://github.com/samtools/htsjdk/blob/fbba5364e1809de071bc479f30e4e2c8b17f5bbe/src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterFactory.java#L252

Possibly the solution could be changing .contains() to .endsWith() but I'm not sure why that wasn't done in the first place. Are there some BCF files that are expected to have a suffix beyond .bcf ?

nh13 commented 7 years ago

Likely it was there to support .bcf.gz, but the variant package is a bit mysterious to me.

magicDGS commented 7 years ago

That class is deprecated, and actually I did a PR to remove it (https://github.com/samtools/htsjdk/pull/756). Does this happen when using VariantContextWriterBuilder, @jrandall?

magicDGS commented 7 years ago

Checking the no-deprecated class, this should not happen (see this: https://github.com/samtools/htsjdk/blob/fbba5364e1809de071bc479f30e4e2c8b17f5bbe/src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterBuilder.java#L498).

jrandall commented 7 years ago

@magicDGS I don't use htsjdk as a developer - I ran into this problem using the GATK and tracked it down to this class. It looks like the GATK is still using the deprecated class: https://github.com/broadgsa/gatk-protected/blob/f185a75e1c49fb4c039511e61254da0509833ee9/public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/io/storage/VariantContextWriterStorage.java#L139

nh13 commented 7 years ago

@vdauwera given this ultimately shows up in GATK, you may want to take a look.

vdauwera commented 7 years ago

@ronlevine Could you please check if we're still using this in the dev snapshot? If it's an easy change I'd love to get this fixed for the 3.7 release, even if it means delaying a day or two.

ronlevine commented 7 years ago

It's in the dev snapshot. So, the logic should be changed from: location.getName().endsWith(".bcf") to endsWith(".bcf.gz") || endsWith(".bcf") to make it more flexible.

vdauwera commented 7 years ago

@ronlevine To be clear I think we would want to update to using the newer class mentioned above -- especially given @magicDGS 's PR to remove the old class entirely -- though perhaps the new class needs to be updated to allow bcf.gz.

ronlevine commented 7 years ago

@vdauwera The logic snippet pertains to the non-deprecated class https://github.com/samtools/htsjdk/blob/fbba5364e1809de071bc479f30e4e2c8b17f5bbe/src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterBuilder.java#L498

vdauwera commented 7 years ago

Oh ok

ronlevine commented 7 years ago

In GATK's VariantContextWriterStorage, replace VariantContextWriterFactory with the appropriateVariantContextWriterBuilder logic. Note that the aforementioned HTSJDK modification in my previous comment is necessary for handling the .bcf.gz case.

ronlevine commented 7 years ago

On second thought, there should be no need to make a change in HTSJDK since any file ending in .gz, .gzip, .bgz, .bgzf is handled as compressed VCF

vdauwera commented 7 years ago

On second thought, there should be no need to make a change in HTSJDK since any file ending in .gz, .gzip, .bgz, .bgzf is handled as compressed VCF

Is that actually desired behavior? Doesn't that mean we can't write a bcf.gz, unless there is logic within the "compressed VCF" handling to allow the BCF case?

ronlevine commented 7 years ago

We can not write a bcf.gz. This is not the desired behavior, It should be a compressed BCF file. Instead bcf.gz is output as a bcf. There's a similar forum issue for reading bcf.gz files. Removing file.getName().endsWith(".bcf.gz") from https://github.com/broadinstitute/gsa-unstable/blob/master/public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/io/stubs/VariantContextWriterStub.java#L219 will not force the file to be a BCF and will produce a compressed VCF. To investigate this behavior, I created a test on the branch rhl_validate_bcf.gz that checks that the output file is a compressed BCF.

In order to output a bcf.gz, the outStreamFromFile in VariantContextWriter#build() must be converted to a BCF stream for BlockCompressedOutputStream.

ronlevine commented 7 years ago

@yfarjoun Do you have any comments on this one? If not, I think this should be closed.

yfarjoun commented 7 years ago

have we resolved the issue? I think that the issue is not an HTSJDK issue, but rather a GATK issue that was solved...if that's the case, I'm happy for this issue to be closed...but let's ask the OP: @jrandall ?