hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
978 stars 245 forks source link

[batch] expose job cloud location to input, main, and output containers #14189

Open danking opened 9 months ago

danking commented 9 months ago

What happened?

Batch should expose a job's cloud location to the job. In particular, now that multi-regional buckets charge egress, users needing large numbers of cores will need to manually duplicate their data in multiple regions and then choose the correct data source based on the region in which the job is scheduled.

The implementor should consider other options but here is an initial proposal:

  1. Input and output files become dictionaries mapping from location to input/output. (If location is not found in list, job fails).
  2. Main container's file system and environment are populated with information about the location.

Implementor should consider whether region, zone, or both should be exposed in GCP. Likewise for Azure regions and AZs.

References

Version

0.2.127

Relevant log output

No response

daniel-goldstein commented 9 months ago

Note that currently Hail Batch does add the HAIL_REGION environment variable, but this does not include information about the zone in GCP. If there is a metadata server implementation in the cloud exposing such an endpoint could be a reasonable approach over adding environment variables.