samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
276 stars 244 forks source link

SAMFileWriterFactory creates .bai file when writing .cram file #1672

Open rickymagner opened 1 year ago

rickymagner commented 1 year ago

Description of the issue:

When using the SAMFileWriterFactory to write a .cram file, when the "create index" default is toggled on, it will create a .bai file for the index rather than .crai. This means that e.g. when running gatk MergeSamFiles --CREATE_INDEX… with a .cram output, you end up with an output.cram.bai file instead of output.cram.crai.

Your environment:

Steps to reproduce

Run gatk MergeSamFiles as described above.

Expected behaviour

You should get a .crai file.

Actual behaviour

You get a .bai file.


There are a few very old issues surrounding .crai files in the repo. According to this issue it seems like support was added for this but kept off for reasons discussed here. Perhaps it's too much to resurrect the project of getting these indices sorted out, but at the moment is seems GATK just silently puts out .cram.bai files due to this, which can be pretty confusing. I don't know enough about CRAM vs BAM to know how bad it might be to use one index for the other, but at least GATK seems to work just fine doing random access on CRAMs with the .bai file produced as described above. Also not sure if this issue should be pushed up to GATK or kept down here in htsjdk. At the very least it'd be nice if the library could be updated to use the proper file extension for the index.

lbergelson commented 11 months ago

@rickymagner It's actually producing a bai index, not a crai. So it would be equally wrong to rename it to crai. It would be great to fix it to make a crai index but I think it's a bit of a project.