samtools / htslib

C library for high-throughput sequencing data formats
Other
803 stars 446 forks source link

slow access to s3 resources (initial 3-4s lag) #504

Open blajoie opened 7 years ago

blajoie commented 7 years ago

Hi,

We've noticed that samtools/htslib seems to have a small time penalty when making an initial connection to any s3 resource (local s3 or amazonaws s3).

e.g. 1-2 second delay samtools view "https://s3-us-west-1.amazonaws.com/ilmn.htslib-test/test.bam samtools view "https://s3-us-west-1.amazonaws.com/ilmn.htslib-test/test.bam" chr3:49000000-50000000

However a direct curl seems to begin streaming in < 1 second curl -XGET "https://s3-us-west-1.amazonaws.com/ilmn.htslib-test/test.bam" | samtools view

This time delay does not appear to be due to the initial download of the BAI, as it still happens after the index is cached, nor does it appear to be the result of the s3 auth (above is a public object)

Are there any sort of wait/timeout(s) happening during the initial call? Or something with libcurl?

Access times for local BAMs are close to instant (as expected). We (@nhjohnson) expect a small network related time delay, but is it possible something else is causing the lag?

Thanks !

jkbonfield commented 7 years ago

You can try strace -f -tT samtools view ... to show the time taken for each system call. That may help to show where any delays are happening.

I'm not aware of any timeouts in the code, but it has many layers involved so I may have missed something.

jkbonfield commented 7 years ago

I just did this myself and see ~2sec in select timeouts.

$ strace -o tr -ftT samtools view -H "https://s3-us-west-1.amazonaws.com/ilmn.htslib-test/test.bam" > /dev/null
$ fgrep 'select(4, [3]' tr
29431 15:21:48 select(4, [3], [], [], {0, 1000}) = 0 (Timeout) <0.001089>
29431 15:21:48 select(4, [3], [], [], {149, 991000}) = 1 (in [3], left {149, 843643}) <0.147418>
29431 15:21:48 select(4, [3], [], [], {149, 840000}) = 1 (in [3], left {149, 693125}) <0.146930>
29431 15:21:48 select(4, [3], [], [], {149, 686000}) = 1 (in [3], left {149, 538901}) <0.147188>
29431 15:21:48 select(4, [3], [], [], {149, 536000}) = 1 (in [3], left {149, 378097}) <0.157968>
29431 15:21:48 select(4, [3], [], [], {149, 377000}) = 1 (in [3], left {149, 376996}) <0.000107>
29431 15:21:48 select(4, [3], [], [], {149, 376000}) = 1 (in [3], left {149, 231022}) <0.145041>
29431 15:21:48 select(4, [3], [], [], {149, 230000}) = 1 (in [3], left {149, 229996}) <0.000121>
29431 15:21:48 select(4, [3], [], [], {149, 229000}) = 1 (in [3], left {149, 228996}) <0.000024>
29431 15:21:48 select(4, [3], [], [], {149, 228000}) = 1 (in [3], left {149, 83445}) <0.144622>
29431 15:21:49 select(4, [3], [], [], {0, 1000}) = 0 (Timeout) <0.001076>
29431 15:21:49 select(4, [3], [], [], {149, 998000}) = 1 (in [3], left {149, 850148}) <0.147911>
29431 15:21:49 select(4, [3], [], [], {149, 848000}) = 1 (in [3], left {149, 700669}) <0.147389>
29431 15:21:49 select(4, [3], [], [], {149, 695000}) = 1 (in [3], left {149, 546471}) <0.148588>
29431 15:21:49 select(4, [3], [], [], {149, 544000}) = 1 (in [3], left {149, 372074}) <0.171989>
29431 15:21:49 select(4, [3], [], [], {149, 370000}) = 1 (in [3], left {149, 369996}) <0.000110>
29431 15:21:49 select(4, [3], [], [], {0, 1000}) = 0 (Timeout) <0.001190>
29431 15:21:49 select(4, [3], [], [], {10, 0}) = 1 (in [3], left {9, 847421}) <0.152651>
29431 15:21:50 select(4, [3], [], [], {10, 0}) = 1 (in [3], left {9, 999996}) <0.000099>
29431 15:21:50 select(4, [3], [], [], {10, 0}) = 1 (in [3], left {9, 854190}) <0.145862>
29431 15:21:50 select(4, [3], [], [], {10, 0}) = 1 (in [3], left {9, 999997}) <0.000072>
29431 15:21:50 select(4, [3], [], [], {10, 0}) = 1 (in [3], left {9, 999998}) <0.000011>
29431 15:21:50 select(4, [3], [], [], {10, 0}) = 1 (in [3], left {9, 853681}) <0.146383>

This isn't actually using the s3 interface, but https instead - hence libcurl. There are timeouts mentioned in the htslib curl interface, eg in wait_perform, but they're all powers of 10 milliseconds and not the ~145ms we see here. I expect this is just some round-trip time to AWS and the aggregate delay comes through many round-trips. I'm not familier with the https protocols to understand why so many round-trips are used.

I see htslib is doing sendto "CONNECT" twice, presumably once for file format detection and once to download, but that doesn't explain the large number of subsequent select calls. Maybe it doesn't have a large enough buffer size. @jmarshall Do you remember what's going on here?