Closed DSuveges closed 1 month ago
Based on a discussion with @addramir and @d0choa, we agreed that changing the coordinates of the variant annotation table is too disruptive with a wide series of negative consequences in variant data harmonisation and ingestion. So it has been decided that we revert:
Link to QC notebook: here
Conclusions:
gnomad3VariantId
column is dropped. Good.New data
-RECORD 0---------------------------
variantId | 10-41776411-C-CT
chromosome | 10
position | 41776411
referenceAllele | C
alternateAllele | CT
Old data:
-RECORD 0----------------------------
variantId | 10_41776412_C_CT
chromosome | 10
position | 41776412
referenceAllele | C
alternateAllele | CT
gnomad3VariantId | 10-41776411-C-CT
Old ld-index:
+-----------------+
|variantId |
+-----------------+
|6_100124716_CA_C |
|6_100124716_C_CA |
|6_100124715_C_A |
|6_100124716_C_CAA|
+-----------------+
New ld-index:
+-----------------+
|variantId |
+-----------------+
|6_100124715_C_CA |
|6_100124715_C_CAA|
|6_100124715_C_A |
|6_100124715_CA_C |
+-----------------+
It might be my misunderstanding but mentioning just in case. The variantId
in the new data pasted above has hyphens instead of underscores. That would not play well with everything else.
It might be my misunderstanding but mentioning just in case. The
variantId
in the new data pasted above has hyphens instead of underscores. That would not play well with everything else.
Yes, you are right! I just renamed the column from gnomad3VariantId
to variantId
. Applying the fix. Done.
Fix applied, code pushed to branch. DAG is re-run, saved here: gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/variant_annotation
Sample:
-RECORD 0----------------------------------
variantId | 10_41776397_C_G
chromosome | 10
position | 41776397
chromosomeB37 | 10
positionB37 | 42463812
referenceAllele | C
alternateAllele | G
rsIds | [rs1427850406]
alleleType | snv
alleleFrequencies | [{afr_adj, 0.0}, ...
vep | {intergenic_varia...
inSilicoPredictors | {{1.028, -0.05497...
only showing top 1 row
It has been noticed in the past (upon mapping GWAS Catalog curated associations), that coordinates of indel in Ensembl and GnomAD different by 1 position. (more info here: https://www.biostars.org/p/84686/). This affects indels only. This discrepancy causes that indels from GWAS Catalog could not be resolved against GnomAD variant reference, leading to there's no indels in the production genetics portal.
Tasks:
Based on what we find we need to find out how to unify the indel representation in gentropy.