samtools / htslib

C library for high-throughput sequencing data formats
Other
800 stars 446 forks source link

Google Cloud Storage hFILE not using multipart / resumable uploads or downloads for reliability #1814

Open pettyalex opened 1 month ago

pettyalex commented 1 month ago

Summary: GCS Read/Write could use resumable uploads and downloads to recover better from network failures or other transient issues

I've noticed that the Google Cloud Storage support for htslib makes single requests for both download and upload, without doing resumable, multiple chunk, or multipart uploads. I believe that using resumable uploads and using Range headers on downloads could significantly increase reliability when working with GCS, potentially fixing most of the bugs related to read/write problems from GCS that I've seen reported here. I've personally had a pretty bad time trying to read/write large files in GCS, it works intermittently but we experience failures every few hours that make dealing with large files infeasible.

My group is trying to work in Google Cloud via terra.bio, and hoping to be able to stream input and output from Google Cloud Storage so that we avoid having to copy around >1TB vcf.gz and bcfs: https://github.com/samtools/bcftools/issues/2235

Google's recommendations for streaming uploads and downloads are here: https://cloud.google.com/storage/docs/streaming-uploads https://cloud.google.com/storage/docs/streaming-downloads

I see two main ways to approach this. It would be possible to have hfile_gcs wrap hfile_libcurl just like it currently does, make a request to start a resumable upload before it starts sending data, and then handle creating a new hFILE for each large chunk. hfile_gcs could also handle retrying, although if we want robust retry logic we'd need to keep each chunk in memory until we know it's been successfully sent.

The other approach would be to rework, extend, or wrap hfile_s3_write, because Google Cloud Storage also supports XML multipart uploads matching the S3 API: https://cloud.google.com/storage/docs/multipart-uploads

It's also possible to work around this in some situations by using the GCP cli to do the read/write, but this won't work in all situations. For example, one can: gcloud storage cat gs://my-bucket/my-file.bcf | bcftools view | gcloud storage cp - gs://my-bucket/the-output.bcf

It would also be really nice to use range requests for reading, as it'd be possible to request just one bgzf block at a time if you're doing random I/O.

whitwham commented 1 month ago

These are all good things that we should add. Our entire cloud storage code needs looking at and seeing what we can do better with. At the moment we are spread a bit thin due to other projects, but hopefully we can get around to this in the not too distant future.