Open jrandall opened 7 years ago
Relevant code is here: https://github.com/samtools/htsjdk/blob/fbba5364e1809de071bc479f30e4e2c8b17f5bbe/src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterFactory.java#L252
Possibly the solution could be changing .contains()
to .endsWith()
but I'm not sure why that wasn't done in the first place. Are there some BCF files that are expected to have a suffix beyond .bcf
?
Likely it was there to support .bcf.gz
, but the variant package is a bit mysterious to me.
That class is deprecated, and actually I did a PR to remove it (https://github.com/samtools/htsjdk/pull/756). Does this happen when using VariantContextWriterBuilder
, @jrandall?
Checking the no-deprecated class, this should not happen (see this: https://github.com/samtools/htsjdk/blob/fbba5364e1809de071bc479f30e4e2c8b17f5bbe/src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterBuilder.java#L498).
@magicDGS I don't use htsjdk as a developer - I ran into this problem using the GATK and tracked it down to this class. It looks like the GATK is still using the deprecated class: https://github.com/broadgsa/gatk-protected/blob/f185a75e1c49fb4c039511e61254da0509833ee9/public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/io/storage/VariantContextWriterStorage.java#L139
@vdauwera given this ultimately shows up in GATK, you may want to take a look.
@ronlevine Could you please check if we're still using this in the dev snapshot? If it's an easy change I'd love to get this fixed for the 3.7 release, even if it means delaying a day or two.
It's in the dev snapshot. So, the logic should be changed from:
location.getName().endsWith(".bcf")
to
endsWith(".bcf.gz") || endsWith(".bcf")
to make it more flexible.
@ronlevine To be clear I think we would want to update to using the newer class mentioned above -- especially given @magicDGS 's PR to remove the old class entirely -- though perhaps the new class needs to be updated to allow bcf.gz.
@vdauwera The logic snippet pertains to the non-deprecated class https://github.com/samtools/htsjdk/blob/fbba5364e1809de071bc479f30e4e2c8b17f5bbe/src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterBuilder.java#L498
Oh ok
In GATK's VariantContextWriterStorage, replace VariantContextWriterFactory
with the appropriateVariantContextWriterBuilder
logic. Note that the aforementioned HTSJDK modification in my previous comment is necessary for handling the .bcf.gz
case.
On second thought, there should be no need to make a change in HTSJDK since any file ending in .gz, .gzip, .bgz, .bgzf
is handled as compressed VCF
On second thought, there should be no need to make a change in HTSJDK since any file ending in .gz, .gzip, .bgz, .bgzf is handled as compressed VCF
Is that actually desired behavior? Doesn't that mean we can't write a bcf.gz, unless there is logic within the "compressed VCF" handling to allow the BCF case?
We can not write a bcf.gz. This is not the desired behavior, It should be a compressed BCF file. Instead bcf.gz
is output as a bcf
. There's a similar forum issue for reading bcf.gz files. Removing file.getName().endsWith(".bcf.gz")
from https://github.com/broadinstitute/gsa-unstable/blob/master/public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/io/stubs/VariantContextWriterStub.java#L219 will not force the file to be a BCF and will produce a compressed VCF.
To investigate this behavior, I created a test on the branch rhl_validate_bcf.gz
that checks that the output file is a compressed BCF.
In order to output a bcf.gz
, the outStreamFromFile
in VariantContextWriter#build() must be converted to a BCF stream for BlockCompressedOutputStream
.
@yfarjoun Do you have any comments on this one? If not, I think this should be closed.
have we resolved the issue? I think that the issue is not an HTSJDK issue, but rather a GATK issue that was solved...if that's the case, I'm happy for this issue to be closed...but let's ask the OP: @jrandall ?
Subject of the issue
Output filenames containing the string ".bcf" anywhere in the filename result in BCF output when VCF output is desired. For example, if one has filenames containing an md5 offset by dots, and an md5 happens to start with "bcf" then you can wind up with unexpected output format.
Expected behaviour
An output filename such as
library.bcfce04b6b0fba5bd286ed9a1c9ebfe1.g.vcf.gz
should produce VCF output.Actual behaviour
An output filename such as
library.bcfce04b6b0fba5bd286ed9a1c9ebfe1.g.vcf.gz
produces BCF output.