samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Enable subreading of bam file for htsget #1602

Closed nanjiangshu closed 1 year ago

nanjiangshu commented 1 year ago

This PR fixes the problem to subread the BAM file (with index file) for htsget

Description of the problem When trying to subread the BAM file with htsget protocol, e.g.

samtools view  http://localhost:3000/reads/s3/NA12878_20k_b37.bam 1:10000-12000

where http://localhost:3000/reads/s3/NA12878_20k_b37.bam is the url for the BAM file according to htsget-refserver While one is expecting reads covering the region 10000 to 12000 of reference 1, however, the following error will occur for the above command

[E::idx_test_and_fetch] Format of index file 'http://localhost:3000/reads/s3/NA12878_20k_b37.bam.bai' is not supported
[E::idx_find_and_load] Could not retrieve index file for 'http://localhost:3000/reads/s3/NA12878_20k_b37.bam'
samtools view: Random alignment retrieval only works for indexed SAM.gz, BAM or CRAM files.

Changes made When the provided filename is an htsget url, such as http://localhost:3000/reads/s3/NA12878_20k_b37.bam, the returned value of HTTP GET is a JSON body of the format

{"htsget":{"format":"BAM","urls":[{"url":"http://s3-archive:9000/archive/NA12878_20k_b37.bam","headers":{"Range":"bytes=0-8820569"}}]}}

With the change of this PR, the code will extract the actual url of the BAM file, i.e. http://s3-archive:9000/archive/NA12878_20k_b37.bam in this example and use this url for further processing. If the provided filename is not an htsget url, no changes will be made.

daviesrob commented 1 year ago

Unfortunately htsget doesn't work this way. In particular, there's no guarantee that the htsget response will list a single url, or that the URL will be a complete BAM or CRAM file, and there should be no expectation that the file has an index associated with it.

If you want to make htsget work correctly with iterators (which has been on our to-do list for rather too long now), you need to create a new htsget request with additional referenceName, start and end parameters to select the region. The new request may also be a POST one instead of GET, if the server supports POST and you want to retrieve multiple regions. Unfortunately getting this to work will require some tricky changes to the iterator code, which is why it hasn't happened yet...