lilab-bcb / cumulus

Cloud-based scalable and efficient single-cell genomics workflows
https://cumulus.readthedocs.io
BSD 3-Clause "New" or "Revised" License
54 stars 30 forks source link

move away from multi-regional #410

Open bgorissen opened 2 months ago

bgorissen commented 2 months ago

Since October 2022, reading data from a bucket in us-multi is no longer free:

Reading data in a Cloud Storage bucket located in a multi-region from a Google Cloud service located in a region on the same continent will no longer be free; instead, such moves will be priced the same as general data moves between different locations on the same continent.

It seems like the cumulus workflows have not been updated after this change:

  1. The zones argument includes us-central, us-east, and us-west by default.
  2. The pipeline pulls resources from gs://regev-lab which is requester-pays and us-multi.

I just used the workflow for cellranger count, and the costs were 50% higher than necessary due to data transfer costs. An individual user can avoid transfer fees by mirroring the resources and setting the genome_file parameter to a URL. Perhaps the choice should be explicit by making zones a required parameter, and perhaps the resources could be made available from a bucket in us-central.

yihming commented 1 week ago

Hi @bgorissen ,

Thanks a lot for reporting this price issue!

As a general workflow, it's hard to make its zones in use always stick to the same as the Google bucket from which the input data are. I believe this mechanism should be considered by Cromwell, the workflow execution engine underline, and only at that level can such consistency be applied.

What I can do at our side is to create a dedicated section in our docs page to highlight this pricing issue, so that users can adjust by themselves.

The gs://regev-lab bucket is maintained by Broad Institute, and our team doesn't have management permission. I'll let them notice this pricing issue.

I don't know how you run your workflows via GCP, but just would like to share that by using GCP Batch, which will replace Google Life Sciences API in 2025/07, if you deploy it within one region, then you no longer need to specify zones in your workflow input, and your jobs would be only executed within that region.