GnomAD steps configuration extraction and versioning

project-defiant commented 1 month ago

As a developer I want to extract hardcoded gnomAD configuration object storage keys

to increase readability and maintainability of the codebase and enable staging storage versioning that will reflect the gnomAD updates.

Background

GnomAD provides updates to their database that needs to be reflected in open targets gentropy inputs. Currently gnomAD goes by the version 4.1 which introduced bugfix to issue with lower AN and AF numbers in a part of UK Biobank [UKB] exomes or non-UKB exomes - more details

Data that comes from gnomAD is used in the preprocess_gnomad DAG. DAG sets up two tasks:

ot_ld_index - airflow task that uses the ld_index hydra command to run the LDIndexStep step. The step implements the interface to run the GnomADLDMatrix class method to write the result table to the ld_index_out cli variable set up in the hydra config
ot_variant_annotation - airflow task that uses the variant_annotation gentropy CLI command to run the VariantAnnotationStep step. The step implements the interface to run the GnomADVariants class method to write the result table to the variant_annotation_path cli variable set up in hydra config

Both classes expose class attributes to read raw data from the gnomAD public registry in GnomADVariants and GnomADLDMatrix

These class attributes define the gnomAD data source paths, that link to v2.1.1 (LD data) and v4.0 (variant data). While the LD data is not present in the new release, the variant data is under v4.1. In case we want to use the output the v4.1 variants data, we need to change the paths that live in the library, which is not considered as best practice.

Keeping in mind above the strategy to resolve the issue with the new gnomAD (provider) data releases we need to add additional flags to the CLI, while retaining meaningful defaults.

GnomAD versioning can be fetched by using the gsutil command

gsutil ls gs://gcp-public-data--gnomad/release/
gs://gcp-public-data--gnomad/release/
gs://gcp-public-data--gnomad/release/2.1.1/
gs://gcp-public-data--gnomad/release/2.1/
gs://gcp-public-data--gnomad/release/3.0.1/
gs://gcp-public-data--gnomad/release/3.0/
gs://gcp-public-data--gnomad/release/3.1.1/
gs://gcp-public-data--gnomad/release/3.1.2/
gs://gcp-public-data--gnomad/release/3.1.3/
gs://gcp-public-data--gnomad/release/3.1/
gs://gcp-public-data--gnomad/release/4.0/
gs://gcp-public-data--gnomad/release/4.1/
gs://gcp-public-data--gnomad/release/v4.0/

versions uses major.minor.patch system. The meaningful default should be extracted as the latest version available. This should result in the addition of new flags to the CLI that will be passed downstream to the GnomADLDMatrix and GnomADVariants dataclasses.

Following paths has to be taken into account: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht gs://hail-common/references/grch38_to_grch37.over.chain.gz gs://hail-common/references/grch37_to_grch38.over.chain.gz

and populations

Variants: "afr", # African-American "amr", # American Admixed/Latino "ami", # Amish ancestry "asj", # Ashkenazi Jewish "eas", # East Asian "fin", # Finnish "nfe", # Non-Finnish European "mid", # Middle Eastern "sas", # South Asian "remaining", # Other
LD index "afr", # African-American "amr", # American Admixed/Latino "asj", # Ashkenazi Jewish "eas", # East Asian "fin", # Finnish "nfe", # Non-Finnish European "nwe", # Northwestern European "seu", # Southeastern European

The new CLI flags should capture:

gnomad public bucket (--gnomad_public_bucket) that defaults to gs://gcp-public-data--gnomad/release/ - used to retrieve the information about the gnomAD dataset version
liftover_chain_file (--liftover reference) that defaults to existing chain path appropriate to the step
populations (--populations) with defaults as above
customised paths to the resources
- GnomADVariants
  - variants hail table (--genome_variants_table)
- GnomADLDMatrix
  - matrix table (--matrix_table_path) - these will require template variable for populations
  - variant indices (--variant_indices_path) - these will require template variable for populations Eventually the variables used as dataclass inputs should be set as defaults, while preserving the possibility to overwrite them by the CLI.
    Tasks
[x] Add new initialization arguments to the LDIndexStep and VariantAnnotationStep
[x] Add new configuraration flags to the Config class
[x] Add new configrutation flags to the hydra config in config directory
[x] Update v2.1.1 populations by adding Estonian population (currently missing)
[x] Add functionality to infer the version of the gnomAD release
[ ] ~~Update output path for ot_ld_index and ot_variant_annotation steps to use the gnomAD release~~
[x] Unify lift over chain flags

Acceptance tests

How do we know the task is complete?

When one can provide additional configuration to the preprocess_gnomad DAG and gentropyvariant_annotation and ld_index step.
When running preprocess_gnomad DAG does not fail due to non-empty output bucket.

project-defiant commented 1 month ago

@d0choa @Daniel-Considine FYI

d0choa commented 1 month ago

cc @DSuveges

opentargets / issues

GnomAD steps configuration extraction and versioning #3324

Background

Tasks

Acceptance tests