pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

pysam.fetch on S3 Bucket #1215

Closed StephanHolgerD closed 1 year ago

StephanHolgerD commented 1 year ago

Hi, I want to report a potentially problematic behaviour using pysam.fetch on AWS S3 bucket infrastructure. Using the following pseudo code on a Bam file in a S3 Bucket will create requests without a defined end range.

Code

with pysam.AlignmentFile(bamfile_S3,filepath_index=baifile_S3) as f:       for r in f.fetch(chrom,start,end):

Request

image

This kind of 'open' request results in high egress costs because aws logs the whole file after the start byte as delivered, even if you stop reading the data at the end of your fetch coordinates.


Compared to the requests from IGV on S3 data (low egress costs, only the exact byte range is logged)

Request image

jmarshall commented 1 year ago

As the User-Agent header suggests, pysam is simply using the wrapped htslib code to implement S3 access. That htslib code (in _hfilelibcurl.c) still implements seeks using CURLOPT_RESUME_FROM_LARGE rather than CURLOPT_RANGE, which would be better placed to Include an ending offset.

Please report this issue to htslib directly.

StephanHolgerD commented 1 year ago

ok, so it makes sense that pysam.view creates clean range requests ?

I thought pysam uses the same lib like samtools (htslib)

jmarshall commented 1 year ago

Pysam (both fetch and view) uses the same library as samtools. This is why you should report this issue to the library where the problem is, namely htslib.

StephanHolgerD commented 1 year ago

ok because pysam.view has a clean range request, that's why I was wondering

jmarshall commented 1 year ago

Please report this problem to htslib.