samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

wishlist s3 cache #1748

Closed cariaso closed 4 months ago

cariaso commented 4 months ago

htslib is awesome, I feel embarrassed and greedy to wish for more, but I have a workflow where I see a potential for a big win, and I'm sure I'm not the only one who would benefit.

Consider https://requests-cache.readthedocs.io/en/stable/ a transparent drop in for requests, which does local caching.

and imagine how many of us are pulling the same 0.01% of the same s3 file over and over again during development, and never need 99.99% of the whole BAM/CRAM

if there was an ENV var, or other config to allow htslib-s3-plugin to write to a local cache, and to pull from that when available it seems like that would payoff quite well for a presumably common use case.

even if it isn't used in every code path, a partial win would still payoff handsomely.

it seems sufficiently adjacent to https://github.com/samtools/htslib/issues/1670 that perhaps during that work, leaving some notes about the relevant paths would be possible

daviesrob commented 4 months ago

While a nice idea, I fear this could be quite complicated in practice. In particular, managing a cache between multiple processes that may even be running on different hosts is likely to lead to a number of low-level difficulties.

For the use case you mention, it may be better to set up a separate process to act as a caching proxy. Having only one process to manage the files makes things much easier. A quick web search brings up https://github.com/rhelmer/caching-s3-proxy although it seems not to have been touched for a while, and I don't know how well it would work on very large files.

cariaso commented 4 months ago

I'll give some thought to the mitm style solution you've suggested.

alexpreynolds commented 4 months ago

You might also take a look at CloudWatch, which offers some caching of S3 requests.

cariaso commented 4 months ago

I have to imagine you didn't intend CloudWatch, that seems very unrelated.

However https://github.com/nginxinc/nginx-s3-gateway is another possibility.

alexpreynolds commented 4 months ago

CloudFront offers the ability to cache requests to S3-sourced files. I'm mentioning it here as it may be another option for those using S3 for sharing BAM files. Hope this might be useful. Apologies if you're talking about something else.

edit: I meant CloudFront, not CloudWatch. Sorry!

cariaso commented 4 months ago

can you point to a URL that gives a little more documentation on what you’re thinking of. The cloud watch that I know serves a very different purpose as near as I understand.