Closed MattWellie closed 1 year ago
Hoping to hear from Greg/Miah (MSFT) about their progress on tweaking AIP to work in their Azure instance of analysis-runner. Their findings will hopefully help find a solution to this issue
Note that it's already possible to avoid our strict naming scheme by specifying a dataset_path
in the config: https://github.com/populationgenomics/cpg-utils/blob/8f805b492b6dcd28a7153506db96bd13b9e75784/cpg_utils/hail_batch.py#L267
@lgruen Ahhh of course, I read that block of code and completely missed the implications.
So far I've not touched the environment configuration, just passing a settings JSON to be used at runtime. If a user wants to change pathing etc. then they would only have to provide a different config file.
This has me thinking that I should replace the runtime JSON with a TOML version, and use the get/set_config() methods to interact with the contents, instead of passing around the dictionary
Yes, I think that would be nice. Note that we also allow specifying multiple config files in the CPG_CONFIG_PATH
environment variable. They get evaluated one after the other and subsequent configs can override previously set values. So you can share common templates and "specialize" for particular use cases.
My opinion is the config should report something like:
[buckets]
category = "bucket"
default = "cpg-{dataset}-{main/test}"
tmp = "cpg-{dataset}-{main/test}-tmp"
web = "cpg-{dataset}-{main/test}-web"
The analysis runner should fill these values in - and the function dataset_path / output_path
+ category
(default="default") should find the key listed in here.
Yes, that makes sense. One clear advantage of this method is that it allows you to specify a different tmp
bucket. That's really important as many pipelines assume that anything written to tmp
will disappear automatically and therefore not accrue storage cost for long periods of time.
The "bucket" terminology is somewhat specific to S3 / GCS. Maybe we can call this [paths]
instead? Any other ideas?
We probably also need to include the gs://
/ https://mystorageaccount.blob.core.windows.net
prefix in the value, right?
@lgruen @illusional
AIP has now been updated to use the cpg-utils config management as standard, which I think solves the AIP part of this discussion (I'll add something to the readme specifically indicating that dataset_path
is now configurable if a user requires that)
Are you happy for this ^^ discussion to be migrated to cpg-utils as an issue?
Closed due to cpg-utils updates that will allow users to avoid analysis-runner setup and add in their own complete paths
Currently all output paths are developed using
output_path
https://github.com/populationgenomics/cpg-utils/blob/main/cpg_utils/hail_batch.py#L297-L337
This includes writes to the main output folder, -temp, and -web
As we are splitting runs across Azure and GCP (and across CPG and non-CPG infrastructure), this forces bucket naming conventions on users. Users will have to adopt the suffix'd naming structure, or this will not be compatible with their deployments.
Possible solution:
Add specific paths to the configuration file, and build paths relative to those?
Pros:
Cons:
Other
cpg-utils
to accommodate that doesn't make much sense