malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Add automatic selection of regional GCS storage #523

Closed alimanfoo closed 1 month ago

alimanfoo commented 2 months ago

This PR adds automatic detection of GCP region, and selects a URL for storage in the same region if available.

This should reduce some network usage costs where we are accessing data from us-central1.

alimanfoo commented 2 months ago

Thanks @ahernank and @leehart. Just to check, is all the data available in the vo_afun_release_master_us_central1 bucket now? I.e., OK to merge this PR?

ahernank commented 2 months ago

Yes, @alimanfoo. The data from Af1.x has been copied to vo_afun_release_master_us_central1.

Unrelated to this PR, we still need to delete the non GT data from the release bucket.

leehart commented 2 months ago

we still need to delete the non GT data from the release bucket.

@ahernank I think we've now decided to not delete the non-GT data from the Af release bucket, and hence the Zarr metadata and this package will not need updating. With that change in mind, I've actually restored the Af1.4 non-GT data to the release bucket. Obviously, this is inconsistent with our current Ag3.x release bucket and our SNP Data Release process. See https://github.com/malariagen/vector-data-processing/issues/36

alimanfoo commented 1 month ago

Thanks both. Surfacing discussion from this morning, we may decide to deprecated the multi-region buckets, so will leave this PR open for now.

alimanfoo commented 1 month ago

Looking at this PR again, I think I'd be in favour of merging this anyway, as it includes some improvements to the logic around when we check the client location and how the colab location check works.

If we then deprecate the multi-region buckets we could make a subsequent PR to simplify back down.