samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

register bgzip in IANA media types? #734

Closed vidboda closed 10 months ago

vidboda commented 1 year ago

HI,

if you check these issues:

you would find out that when used in some frameworks such as Flask, bgzipped files are considered as gzip files based here on python mimetypes, and an unproper header is set.

This can be annoying when using a popular tool such as igv.js as the bgzip file coming from Flask is not correctly handled by the browser.

To summarize, the bgzip file served by Flask (through its own dev server or through a wsgi server, e.g. using mod_wsgi) comes to the web browser with a 'content-encoding': gzip header, which causes the browser to uncorreclty uncompress the file and igv.js not being able to handle it.

Indeed, bgzip is not listed in IANA media types.

Shouldn't it be? And then not being considered as regular gzip files anymore?

jrobinso commented 12 months ago

Just to add I've seen this problem with BAM files on some Apache servers, including data servers at the Broad Institute. So it is not exclusively a Flask issue.

jmarshall commented 12 months ago

GA4GH has been working towards registering media types for various bioinformatics file formats. This is tracked in ga4gh/TASC#36. The focus has been on actual file formats such as BAM (which is compressed as BGZF, so is a particular kind of BGZF) and CRAM (which has its own compression and does not present as a BGZF file or anything else that would be currently recognised by a general-purpose mimetype sniffer), and also the SAM and VCF flavours of text, etc.

We have not considered BGZF for registration, as it is not really a concrete bioinformatics format itself but simply a compression method for use by other formats. But it is something that should be considered in some way, so thanks for highlighting this.

BGZF is a compatible flavour of GZIP — files use the same GZIP headers with the FEXTRA bit set and associated optional fields populated — so is correctly decompressed by gunzip. So in the absence of other information, web servers are not incorrect to serve it with headers indicating that it is gzip compressed. However usually applications do not want the underlying web client to decompress it for them, as they wish to jump around the compressed data stream themselves. IMHO the correct way to prevent general-purpose web clients from doing this is probably for bioinformatics data servers to prevent their underlying server infrastructure from adding non-specific content encoding headers, e.g. as suggested by the OP in the first linked issue.

jrobinso commented 12 months ago

Thanks for the explanation @jmarshall . It is not common for servers to add this header, if it were, web-served tabix files would cease to work in most or possibly all bioinformatics web applications. Its not practical to rely on the browser to correctly decompress a slice of a bgzipped file. One could argue "tabix" is a format with a flexible column structure.

jkbonfield commented 12 months ago

Moved to hts-specs as this isn't an samtools or htslib issue.

jmarshall commented 10 months ago

Thanks for the suggestion. These registrations are being considered by TASC, so I have linked to this discussion on TASC's issue. We'll close this issue here, but expect that TASC will refer to this discussion.