opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Removing tag variant with strange ids from LD index #3622

Open DSuveges opened 1 week ago

DSuveges commented 1 week ago

We have noticed a number of credible sets with strange tag variant ids:

+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|variantId                   |locusIds                                                                                                                                                                                                                                                                                                                                                                                                                |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|14_KI270846v1_alt_496715_C_A|[396d09cef37be9c225bc823e08318533, 2b5037a1dd35a30114156238310b5ea2]                                                                                                                                                                                                                                                                                                                                                    |
|15_KI270850v1_alt_48777_C_T |[9b5f0d5a0b673e285f2e9c42567479f6]                                                                                                                                                                                                                                                                                                                                                                                      |
|15_KI270850v1_alt_82284_G_A |[9b5f0d5a0b673e285f2e9c42567479f6]                                                                                                                                                                                                                                                                                                                                                                                      |
|17_KI270857v1_alt_811563_C_G|[52bd85988ffb4a423c4ddde8f2acd541, 6582e22232b857e7f6c1bb2b78c11588, 331cd6dad2da78cdbee9342b5da00c92, ed3f1c1eca3fa2076147440cae35aecc, 3cae5c79deb1ce0b9467e3f6664a476d, 1f3f9d215d05dcf7e46aa4023c15bee4, ee90704f29898619ca15dd56d2901ade, 3236ea29879564583ef0b58da1e20024, 413f88b1d4f8c2a9cda7e7b134cb29b8, b9d971601c8f5c060a2c135c426f6cdd, 3d395d04dd296ad3bdb4ab9fb5d0d82f, 07c4af8446672e15b56c4223ac86e905]|
|22_KI270928v1_alt_52888_C_T |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_52900_G_T |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_52901_G_C |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_52906_G_A |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_52910_C_G |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_52912_T_G |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_52919_G_C |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_60390_A_G |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
|22_KI270928v1_alt_60414_G_A |[c8cfa456583b2d3f992684d91d6159be]                                                                                                                                                                                                                                                                                                                                                                                      |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 20 rows

These strange variant identifiers (eg. 2_KI270928v1_alt_60414_G_A) are coming from LD index (gs://genetics_etl_python_playground/static_assets/ld_index). A test has identified ~2.3M such variants are in the LD index. If a lead variant is in LD with them, the resulting credible set page will be broken, because these variants are not in the variant index because of obvious reason.

Recommended course of action:

DSuveges commented 1 week ago

Spark expression to flag variants with canonical identifier:

.withColumn(
    'isIdOK',
    f.col("tagVariantId").rlike(r'^[1-9XY]{1,2}_\d+_[ATGC]+_[ATGC]+$')
)