Shift output pathing to config file?

MattWellie commented 1 year ago

Currently all output paths are developed using output_path

https://github.com/populationgenomics/cpg-utils/blob/main/cpg_utils/hail_batch.py#L297-L337

This includes writes to the main output folder, -temp, and -web

As we are splitting runs across Azure and GCP (and across CPG and non-CPG infrastructure), this forces bucket naming conventions on users. Users will have to adopt the suffix'd naming structure, or this will not be compatible with their deployments.

Possible solution:

Add specific paths to the configuration file, and build paths relative to those?

Pros:

any user can quickly adapt the code to run on their own infrastructure (local/gcp/az) and fit their own naming conventions
we would only need 3 entries (main/test/web), and if required these could all point to the same folder

Cons:

more maintenance of the config file required, e.g. different files for test/main

Other

there are surely better ways of addressing this
AIP is intended to be run outside of the CPG, so altering cpg-utils to accommodate that doesn't make much sense

MattWellie commented 1 year ago

Hoping to hear from Greg/Miah (MSFT) about their progress on tweaking AIP to work in their Azure instance of analysis-runner. Their findings will hopefully help find a solution to this issue

lgruen commented 1 year ago

Note that it's already possible to avoid our strict naming scheme by specifying a dataset_path in the config: https://github.com/populationgenomics/cpg-utils/blob/8f805b492b6dcd28a7153506db96bd13b9e75784/cpg_utils/hail_batch.py#L267

MattWellie commented 1 year ago

@lgruen Ahhh of course, I read that block of code and completely missed the implications.

So far I've not touched the environment configuration, just passing a settings JSON to be used at runtime. If a user wants to change pathing etc. then they would only have to provide a different config file.

This has me thinking that I should replace the runtime JSON with a TOML version, and use the get/set_config() methods to interact with the contents, instead of passing around the dictionary

lgruen commented 1 year ago

Yes, I think that would be nice. Note that we also allow specifying multiple config files in the CPG_CONFIG_PATH environment variable. They get evaluated one after the other and subsequent configs can override previously set values. So you can share common templates and "specialize" for particular use cases.

illusional commented 1 year ago

My opinion is the config should report something like:

[buckets]
category = "bucket"
default = "cpg-{dataset}-{main/test}"
tmp = "cpg-{dataset}-{main/test}-tmp"
web = "cpg-{dataset}-{main/test}-web"

The analysis runner should fill these values in - and the function dataset_path / output_path + category (default="default") should find the key listed in here.

lgruen commented 1 year ago

Yes, that makes sense. One clear advantage of this method is that it allows you to specify a different tmp bucket. That's really important as many pipelines assume that anything written to tmp will disappear automatically and therefore not accrue storage cost for long periods of time.

The "bucket" terminology is somewhat specific to S3 / GCS. Maybe we can call this [paths] instead? Any other ideas?

We probably also need to include the gs:// / https://mystorageaccount.blob.core.windows.net prefix in the value, right?

MattWellie commented 1 year ago

@lgruen @illusional AIP has now been updated to use the cpg-utils config management as standard, which I think solves the AIP part of this discussion (I'll add something to the readme specifically indicating that dataset_path is now configurable if a user requires that)

Are you happy for this ^^ discussion to be migrated to cpg-utils as an issue?

MattWellie commented 1 year ago

Closed due to cpg-utils updates that will allow users to avoid analysis-runner setup and add in their own complete paths

populationgenomics / automated-interpretation-pipeline

Shift output pathing to config file? #122

Possible solution: