Open yfarjoun opened 7 years ago
What are you considering doing? Stating one thing as canonical and changing either htslib or htsjdk so new files are generated adhering to the updated specification, but reading old files by checking by paths?
If so it sounds sensible, but my vote would be to go with the names used by the original author ;-)
ouch.
I was hoping that we could use this issue to nail down the proposal, and have the "original authors chime in. then I'll be happy to formalize is in an PR.
The next comment will include a table. Feel free to edit (and comment who changed last)
File Type | Main File Extension | Index File extension | last touch + comments |
---|---|---|---|
Sam | .sam | N/A | @yfarjoun |
Bam | .bam | .bai | @yfarjoun |
Cram | .cram | .cram.crai | @yfarjoun (though really??? shouldn't it be .crai?) |
If we want to nail it down, then IMO we should nail it down to the original filenames supported by both early implementations and revert this picard commit which caused the disparity in the first place!
https://github.com/samtools/htsjdk/commit/7459fbacda9312b28eb4a22200ced530cb8a3297
Surely this horse bolted many years ago for .bai vs .bam.bai. OTOH there may be hope for a single canonical filename for a CRAI index.
I agree with @jmarshall - I think the only reasonable thing to do for BAM is to document that both foo.bai
and foo.bam.bai
are valid index names for foo.bam
. And maybe state a preference going forward, though that may be contentious. I for one read ".bam.bai" and "bam bam index" and dislike it as much as "ATM machine".
Agreed we shouldn't be forcing anything and obviously existing software now has to check both filenames as either can exist, but specifying a preference isn't a bad idea. I wouldn't argue for making the same mistake with CRAM though. If both implementations right now look in .cram.crai then leave well alone!
In the wild there are .bam.bai & .bai as index files for .bam in addition to .cram.crai & .crai as index files for .cram files. This is in addition to .bai (and possibly .cram.bai) files being valid index files for cram. needless to say, this introduces significant overhead to programs that need to look for the index files and possible disagreement between different implementations when multiple valid index files are found (different implementations might search in a different order)
I suggest that we include in the specification a naming convention for the index files.