opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

GnomAD steps configuration extraction and versioning #3324

Closed project-defiant closed 1 month ago

project-defiant commented 1 month ago

As a developer I want to extract hardcoded gnomAD configuration object storage keys

to increase readability and maintainability of the codebase and enable staging storage versioning that will reflect the gnomAD updates.

Background

GnomAD provides updates to their database that needs to be reflected in open targets gentropy inputs. Currently gnomAD goes by the version 4.1 which introduced bugfix to issue with lower AN and AF numbers in a part of UK Biobank [UKB] exomes or non-UKB exomes - more details

Data that comes from gnomAD is used in the preprocess_gnomad DAG. DAG sets up two tasks:

Both classes expose class attributes to read raw data from the gnomAD public registry in GnomADVariants and GnomADLDMatrix

These class attributes define the gnomAD data source paths, that link to v2.1.1 (LD data) and v4.0 (variant data). While the LD data is not present in the new release, the variant data is under v4.1. In case we want to use the output the v4.1 variants data, we need to change the paths that live in the library, which is not considered as best practice.

Keeping in mind above the strategy to resolve the issue with the new gnomAD (provider) data releases we need to add additional flags to the CLI, while retaining meaningful defaults.

GnomAD versioning can be fetched by using the gsutil command

gsutil ls gs://gcp-public-data--gnomad/release/
gs://gcp-public-data--gnomad/release/
gs://gcp-public-data--gnomad/release/2.1.1/
gs://gcp-public-data--gnomad/release/2.1/
gs://gcp-public-data--gnomad/release/3.0.1/
gs://gcp-public-data--gnomad/release/3.0/
gs://gcp-public-data--gnomad/release/3.1.1/
gs://gcp-public-data--gnomad/release/3.1.2/
gs://gcp-public-data--gnomad/release/3.1.3/
gs://gcp-public-data--gnomad/release/3.1/
gs://gcp-public-data--gnomad/release/4.0/
gs://gcp-public-data--gnomad/release/4.1/
gs://gcp-public-data--gnomad/release/v4.0/

versions uses major.minor.patch system. The meaningful default should be extracted as the latest version available. This should result in the addition of new flags to the CLI that will be passed downstream to the GnomADLDMatrix and GnomADVariants dataclasses.

Following paths has to be taken into account: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht gs://hail-common/references/grch38_to_grch37.over.chain.gz gs://hail-common/references/grch37_to_grch38.over.chain.gz

and populations

The new CLI flags should capture:

Acceptance tests

How do we know the task is complete?

  1. When one can provide additional configuration to the preprocess_gnomad DAG and gentropyvariant_annotation and ld_index step.
  2. When running preprocess_gnomad DAG does not fail due to non-empty output bucket.
project-defiant commented 1 month ago

@d0choa @Daniel-Considine FYI

d0choa commented 1 month ago

cc @DSuveges