samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
643 stars 174 forks source link

MIME types for all file formats formally registered with IANA #407

Open MattOates opened 5 years ago

MattOates commented 5 years ago

It would be amazing if all of the file types specified here such as VCF v4.2 had a formally defined and registered MIME type with the IANA for use in interchange. HTSGet JSON payloads already have something like application/vnd.ga4gh.htsget.v1.1.1+json for example. I am unsure who should be the formal vendor for VCF etc., but given the defacto spec comes from samtools it could either be Genome Research Ltd but really is there a reason to not have GA4GH be the holder of these?

I have found this issue frustrating enough over easily the last 10 years I am happy to prepare an application following all the RFCs for someone else to more formally submit. It's worth noting that even the HTSGet MIME doesn't appear to be registered with IANA

jmarshall commented 5 years ago

Yes indeed, this would be good.

I got started on preparing the applications a while back. Perhaps I can dig out those drafts and we can collaborate on getting them done. Do you have experience of doing this previously?

MattOates commented 5 years ago

@jmarshall zero experience :D But it is at least written down in all the RFCs what to put together. I'm tired of having VCFs open as v-cards in windows, and all the other horror. But just writing formal APIs its kind of important to be able to say one expects and will send formally a VCF4.2 in UTF-8 encoding and nothing else. Entirely happy to collaborate on this, I've considered doing it several times prior, but just assumed it would get done by someone else.

jmarshall commented 5 years ago

Vcard's use of the .vcf extension pre-dates genomics' VCF format by about 15 years, so I think you're mostly out of luck opening them in Windows unless you change the extension association yourself locally…

MattOates commented 5 years ago

Sure but if an email or website gave the correct MIME content-type header then it would actually trigger the right application. Or at least have a better chance. That there is no universally agreed MIME means that user applications cannot register this in things like Windows. That's the nice part of seeing this done. If you just respond with text/plain and the file ending in .vcf then we can't avoid the v-card scenario.

Worse is it's quite rare someone wants their browser to render a 100MB whole genome VCF (the default for text/subtype), they probably want it as a file. In the wild this has lead to lots of people using naively application/octet-stream rather than specifying content-disposition headers. Which is in many ways even worse, since this doesn't have a content encoding defined most of the time. I've seen plenty of VCFs with mixed 8bit and wide character encodings etc. It's basically a giant mess that we could at least sort out the core problem and write up best practices for how bio data should be formally interchanged over the web.

Perhaps I'm crazy and am the only person who's had these issues, but I somewhat doubt it. It's just very few people especially in one shot academic resources understand what happens outside of their software as the user/consuming experience. But right now even if they did take great care, it's surprisingly hard to provide a good experience with existing web technologies/standards; without some MIME. For example there is no reason to have bam/vcf/whatever specific HTSGet endpoints via format= query strings, it should all be MIME in the headers of your requests and responses. That the spec isn't like this is an endemic issue in the field, where we all choose to subvert established best practices the web has defined for a real long time.

jfuerth commented 4 years ago

It would be great to see this happen. You are not alone in wishing there were official media types for these formats.

Do you think it would help to start with VCF only, since it has the greatest need to be differentiated from something else? The other formats could be registered later, benefitting from the lessons learned in the VCF registration process.

SamStudio8 commented 9 months ago

@jmarshall I don't suppose there is still any appetite for this?

jmarshall commented 9 months ago

There is indeed. This is being tracked as ga4gh/TASC#36 but as usual the limiting factor is someone finding the time to actually do the leg work.

photocyte commented 6 months ago

Hi @jmarshall I seemingly got down the road of doing this. (See https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/646) on a whim, and only just now found this issue.

I've attatched IANA Media Type submission templates that I drafted for a variety of plain text formats used in bioinformatics, based off the existing gff3 IANA Media type. https://www.iana.org/assignments/media-types/text/gff3

https://github.com/ga4gh/TASC/issues/36 seems to be considering mostly the binary file formats. I.e. bam, .bcf. I hadn't considered those yet.

My reading of RFC6838 (https://www.rfc-editor.org/rfc/rfc6838.html) wasn't that ga4gh (or samtools/hts-specs) had the sole power to submit these applications to IANA, even if they were the current "owner" or governor of the specifications. Here was my boilerplate to deal with this:

General Comments:
   Sequence Ontology is serving as the standards-related organization (per RFC 6838)
   for submission of this candidate Media Type into the Standards Tree.
   Sequence Ontology does not claim to be the "owner" of the file format.
   This submission is being performed as a public service, with the understanding
   that the true file format owner may assert ownership and change controller
   over the type at any time (per RFC 6838).
   To the best of our knowledge, the originator(s) of the file format is/are
   the contributors at https://github.com/samtools/hts-specs

Thoughts? I already have fasta and fastq done, can perhaps submit those through Sequence Ontology and leave vcf, sam, etc. plus their binary variants to you folks?

bed.txt sam.txt vcf.txt