to increase readability and maintainability of the codebase and enable staging storage versioning that will reflect the gnomAD updates.
Background
GnomAD provides updates to their database that needs to be reflected in open targets gentropy inputs. Currently gnomAD goes by the version 4.1 which introduced bugfix to issue with lower AN and AF numbers in a part of UK Biobank [UKB] exomes or non-UKB exomes - more details
Data that comes from gnomAD is used in the preprocess_gnomad DAG. DAG sets up two tasks:
ot_ld_index - airflow task that uses the ld_index hydra command to run the LDIndexStep step. The step implements the interface to run the GnomADLDMatrix class method to write the result table to the ld_index_out cli variable set up in the hydra config
ot_variant_annotation - airflow task that uses the variant_annotation gentropy CLI command to run the VariantAnnotationStep step. The step implements the interface to run the GnomADVariants class method to write the result table to the variant_annotation_path cli variable set up in hydra config
Both classes expose class attributes to read raw data from the gnomAD public registry in GnomADVariants and GnomADLDMatrix
These class attributes define the gnomAD data source paths, that link to v2.1.1 (LD data) and v4.0 (variant data). While the LD data is not present in the new release, the variant data is under v4.1. In case we want to use the output the v4.1 variants data, we need to change the paths that live in the library, which is not considered as best practice.
Keeping in mind above the strategy to resolve the issue with the new gnomAD (provider) data releases we need to add additional flags to the CLI, while retaining meaningful defaults.
GnomAD versioning can be fetched by using the gsutil command
versions uses major.minor.patch system. The meaningful default should be extracted as the latest version available. This should result in the addition of new flags to the CLI that will be passed downstream to the GnomADLDMatrix and GnomADVariants dataclasses.
Following paths has to be taken into account:
gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bmgs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.htgs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.htgs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.htgs://hail-common/references/grch38_to_grch37.over.chain.gzgs://hail-common/references/grch37_to_grch38.over.chain.gz
and populations
Variants:
"afr", # African-American
"amr", # American Admixed/Latino
"ami", # Amish ancestry
"asj", # Ashkenazi Jewish
"eas", # East Asian
"fin", # Finnish
"nfe", # Non-Finnish European
"mid", # Middle Eastern
"sas", # South Asian
"remaining", # Other
LD index
"afr", # African-American
"amr", # American Admixed/Latino
"asj", # Ashkenazi Jewish
"eas", # East Asian
"fin", # Finnish
"nfe", # Non-Finnish European
"nwe", # Northwestern European
"seu", # Southeastern European
The new CLI flags should capture:
gnomad public bucket (--gnomad_public_bucket) that defaults to gs://gcp-public-data--gnomad/release/ - used to retrieve the information about the gnomAD dataset version
liftover_chain_file (--liftover reference) that defaults to existing chain path appropriate to the step
populations (--populations) with defaults as above
customised paths to the resources
GnomADVariants
variants hail table (--genome_variants_table)
GnomADLDMatrix
matrix table (--matrix_table_path) - these will require template variable for populations
variant indices (--variant_indices_path) - these will require template variable for populations
Eventually the variables used as dataclass inputs should be set as defaults, while preserving the possibility to overwrite them by the CLI.
Tasks
[x] Add new initialization arguments to the LDIndexStep and VariantAnnotationStep
[x] Add new configuraration flags to the Config class
[x] Add new configrutation flags to the hydra config in config directory
[x] Update v2.1.1 populations by adding Estonian population (currently missing)
[x] Add functionality to infer the version of the gnomAD release
[ ] Update output path for ot_ld_index and ot_variant_annotation steps to use the gnomAD release
[x] Unify lift over chain flags
Acceptance tests
How do we know the task is complete?
When one can provide additional configuration to the preprocess_gnomad DAG and gentropyvariant_annotation and ld_index step.
When running preprocess_gnomad DAG does not fail due to non-empty output bucket.
As a developer I want to extract hardcoded gnomAD configuration object storage keys
to increase readability and maintainability of the codebase and enable staging storage versioning that will reflect the gnomAD updates.
Background
GnomAD provides updates to their database that needs to be reflected in open targets gentropy inputs. Currently gnomAD goes by the version 4.1 which introduced bugfix to issue with lower AN and AF numbers in a part of UK Biobank [UKB] exomes or non-UKB exomes - more details
Data that comes from gnomAD is used in the
preprocess_gnomad DAG
. DAG sets up two tasks:ld_index
hydra command to run theLDIndexStep
step. The step implements the interface to run theGnomADLDMatrix
class method to write the result table to theld_index_out
cli variable set up in thehydra config
variant_annotation
gentropy CLI command to run theVariantAnnotationStep
step. The step implements the interface to run theGnomADVariants
class method to write the result table to thevariant_annotation_path
cli variable set up inhydra config
Both classes expose class attributes to read raw data from the gnomAD public registry in
GnomADVariants
andGnomADLDMatrix
These class attributes define the gnomAD data source paths, that link to v2.1.1 (LD data) and v4.0 (variant data). While the LD data is not present in the new release, the variant data is under v4.1. In case we want to use the output the v4.1 variants data, we need to change the paths that live in the library, which is not considered as best practice.
Keeping in mind above the strategy to resolve the issue with the new gnomAD (provider) data releases we need to add additional flags to the CLI, while retaining meaningful defaults.
GnomAD versioning can be fetched by using the
gsutil
commandversions uses
major.minor.patch
system. The meaningful default should be extracted as the latest version available. This should result in the addition of new flags to the CLI that will be passed downstream to theGnomADLDMatrix
andGnomADVariants
dataclasses.Following paths has to be taken into account:
gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm
gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht
gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht
gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht
gs://hail-common/references/grch38_to_grch37.over.chain.gz
gs://hail-common/references/grch37_to_grch38.over.chain.gz
and populations
Variants: "afr", # African-American "amr", # American Admixed/Latino "ami", # Amish ancestry "asj", # Ashkenazi Jewish "eas", # East Asian "fin", # Finnish "nfe", # Non-Finnish European "mid", # Middle Eastern "sas", # South Asian "remaining", # Other
LD index "afr", # African-American "amr", # American Admixed/Latino "asj", # Ashkenazi Jewish "eas", # East Asian "fin", # Finnish "nfe", # Non-Finnish European "nwe", # Northwestern European "seu", # Southeastern European
The new CLI flags should capture:
gnomad public bucket (
--gnomad_public_bucket
) that defaults togs://gcp-public-data--gnomad/release/
- used to retrieve the information about the gnomAD dataset versionliftover_chain_file (
--liftover reference
) that defaults to existing chain path appropriate to the steppopulations (
--populations
) with defaults as abovecustomised paths to the resources
--genome_variants_table
)--matrix_table_path
) - these will require template variable for populations--variant_indices_path
) - these will require template variable for populations Eventually the variables used as dataclass inputs should be set as defaults, while preserving the possibility to overwrite them by the CLI.Tasks
[x] Add new initialization arguments to the
LDIndexStep
andVariantAnnotationStep
[x] Add new configuraration flags to the
Config
class[x] Add new configrutation flags to the hydra config in
config
directory[x] Update v2.1.1 populations by adding Estonian population (currently missing)
[x] Add functionality to infer the version of the gnomAD release
[ ]
Update output path for ot_ld_index and ot_variant_annotation steps to use the gnomAD release[x] Unify lift over chain flags
Acceptance tests
How do we know the task is complete?
variant_annotation
andld_index
step.