opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Explore impact of ambigious usage of Ensembl and Gnomad based coordinates for indels #3274

Closed DSuveges closed 1 month ago

DSuveges commented 3 months ago

It has been noticed in the past (upon mapping GWAS Catalog curated associations), that coordinates of indel in Ensembl and GnomAD different by 1 position. (more info here: https://www.biostars.org/p/84686/). This affects indels only. This discrepancy causes that indels from GWAS Catalog could not be resolved against GnomAD variant reference, leading to there's no indels in the production genetics portal.

Tasks:

Based on what we find we need to find out how to unify the indel representation in gentropy.

DSuveges commented 2 months ago

Based on a discussion with @addramir and @d0choa, we agreed that changing the coordinates of the variant annotation table is too disruptive with a wide series of negative consequences in variant data harmonisation and ingestion. So it has been decided that we revert:

The datasets that need to be re-generated:

DSuveges commented 2 months ago

QC Variant annotation

Link to QC notebook: here

Conclusions:

New data

-RECORD 0---------------------------
 variantId       | 10-41776411-C-CT 
 chromosome      | 10               
 position        | 41776411         
 referenceAllele | C                
 alternateAllele | CT  

Old data:

-RECORD 0----------------------------
 variantId        | 10_41776412_C_CT 
 chromosome       | 10               
 position         | 41776412         
 referenceAllele  | C                
 alternateAllele  | CT               
 gnomad3VariantId | 10-41776411-C-CT 

QC LD-index

Old ld-index:

+-----------------+
|variantId        |
+-----------------+
|6_100124716_CA_C |
|6_100124716_C_CA |
|6_100124715_C_A  |
|6_100124716_C_CAA|
+-----------------+

New ld-index:

+-----------------+
|variantId        |
+-----------------+
|6_100124715_C_CA |
|6_100124715_C_CAA|
|6_100124715_C_A  |
|6_100124715_CA_C |
+-----------------+
d0choa commented 2 months ago

It might be my misunderstanding but mentioning just in case. The variantId in the new data pasted above has hyphens instead of underscores. That would not play well with everything else.

DSuveges commented 2 months ago

It might be my misunderstanding but mentioning just in case. The variantId in the new data pasted above has hyphens instead of underscores. That would not play well with everything else.

Yes, you are right! I just renamed the column from gnomad3VariantId to variantId. Applying the fix. Done.

DSuveges commented 2 months ago

Fix applied, code pushed to branch. DAG is re-run, saved here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/variant_annotation

Sample:

-RECORD 0----------------------------------
 variantId          | 10_41776397_C_G      
 chromosome         | 10                   
 position           | 41776397             
 chromosomeB37      | 10                   
 positionB37        | 42463812             
 referenceAllele    | C                    
 alternateAllele    | G                    
 rsIds              | [rs1427850406]       
 alleleType         | snv                  
 alleleFrequencies  | [{afr_adj, 0.0}, ... 
 vep                | {intergenic_varia... 
 inSilicoPredictors | {{1.028, -0.05497... 
only showing top 1 row