samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
653 stars 173 forks source link

Spec needs to include the accepted file extentions for the index files #215

Open yfarjoun opened 7 years ago

yfarjoun commented 7 years ago

In the wild there are .bam.bai & .bai as index files for .bam in addition to .cram.crai & .crai as index files for .cram files. This is in addition to .bai (and possibly .cram.bai) files being valid index files for cram. needless to say, this introduces significant overhead to programs that need to look for the index files and possible disagreement between different implementations when multiple valid index files are found (different implementations might search in a different order)

I suggest that we include in the specification a naming convention for the index files.

jkbonfield commented 7 years ago

What are you considering doing? Stating one thing as canonical and changing either htslib or htsjdk so new files are generated adhering to the updated specification, but reading old files by checking by paths?

If so it sounds sensible, but my vote would be to go with the names used by the original author ;-)

yfarjoun commented 7 years ago

ouch.

I was hoping that we could use this issue to nail down the proposal, and have the "original authors chime in. then I'll be happy to formalize is in an PR.

The next comment will include a table. Feel free to edit (and comment who changed last)

yfarjoun commented 7 years ago
File Type Main File Extension Index File extension last touch + comments
Sam .sam N/A @yfarjoun
Bam .bam .bai @yfarjoun
Cram .cram .cram.crai @yfarjoun (though really??? shouldn't it be .crai?)
jkbonfield commented 7 years ago

If we want to nail it down, then IMO we should nail it down to the original filenames supported by both early implementations and revert this picard commit which caused the disparity in the first place!

https://github.com/samtools/htsjdk/commit/7459fbacda9312b28eb4a22200ced530cb8a3297

jmarshall commented 7 years ago

Surely this horse bolted many years ago for .bai vs .bam.bai. OTOH there may be hope for a single canonical filename for a CRAI index.

tfenne commented 7 years ago

I agree with @jmarshall - I think the only reasonable thing to do for BAM is to document that both foo.bai and foo.bam.bai are valid index names for foo.bam. And maybe state a preference going forward, though that may be contentious. I for one read ".bam.bai" and "bam bam index" and dislike it as much as "ATM machine".

jkbonfield commented 7 years ago

Agreed we shouldn't be forcing anything and obviously existing software now has to check both filenames as either can exist, but specifying a preference isn't a bad idea. I wouldn't argue for making the same mistake with CRAM though. If both implementations right now look in .cram.crai then leave well alone!