mitre / fusera

A FUSE interface to the NCBI Sequence Read Archive (SRA)
Apache License 2.0
29 stars 12 forks source link

fusera and sracp need to both understand and handle gzipped data appropriately #81

Closed mattrbianchi closed 6 years ago

mattrbianchi commented 6 years ago

When using ftp-ncbi region, the Name Resolver API gives ftp urls. fusera and sracp can still use the http protocol with these urls, but the server handling the requests gzips the data, making it illegible for tools expecting it to be in a raw format.

Keeping it simple

The first solution is a bit primitive, but it seems like the server should respect an http request that specifies an Accept-Encoding of identity which would prevent the server from sending gzipped data in the first place. We would not get to use gzipped data over the wire, but it is still up in the air as to whether gzipped data would even be beneficial in our use case. With fusera, there are likely to be generally small reads which translate to small requests over the wire. With small byte ranges, it may take longer to gzip and decompress the data than it would to send it over the wire raw. But with sracp, the use case is obviously advantageous. sracp wants the whole file, thus a gzip transport of large chunks of generally large files would be beneficial.

A more sophisticated approach

Another more sophisticated and more difficult solution would be if we could recognize that the data is gzipped (through content-type in the header) and decompress it on the fly so that sracp and fusera are none the wiser as to what form the data is in as it comes over the wire. I do see some challenges to overcome when implementing this solution though:

  1. When something performs a read of a file with fusera, it gives a byte range it wants. This then results in the http request being made for that range. If the data is gzipped, is the range of raw data gzipped or do we get a byte range of the gzipped data, which will expand to something larger than the range the read requested. What do we do with the extra bytes? Drop them? Hopefully the server gzips on the fly and so the expanded bytes are equivalent to what is requested, making this challenge go away.
  2. Can we even decompress gzipped data on the fly? I imagine so, but I believe some decompression protocols need all the compressed data in order to decompress. This concern is born more through my ignorance of the specific way gzip is implemented and so it might be a non-issue.