opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

StudyLocus dataset contains multiple rows with the same ID #3151

Closed ireneisdoomed closed 7 months ago

ireneisdoomed commented 7 months ago

Describe the bug studyLocusId is not an unique ID in the studyLocus dataset.

Observed behaviour An example:

spark.read.parquet("gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/credible_set/catalog_curated").filter(f.col("studyId") == "GCST001160_1").filter(f.col("variantId").isNull()).select("studyLocusId", "variantId", "studyId", "pValueMantissa", "pValueExponent", "qualityControls").show(truncate=False)
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
|studyLocusId       |variantId|studyId     |pValueMantissa|pValueExponent|qualityControls                                                                  |
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
|6444058520803978058|null     |GCST001160_1|2.0           |-12           |[No mapping in GnomAd, Variant not found in LD reference]                        |
|6444058520803978058|null     |GCST001160_1|2.0           |-10           |[No mapping in GnomAd, Variant not found in LD reference]                        |
|6444058520803978058|null     |GCST001160_1|9.0           |-8            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null     |GCST001160_1|1.0           |-6            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null     |GCST001160_1|3.0           |-6            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null     |GCST001160_1|5.0           |-6            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+

As it can be seen, the root cause is that these are associations for which we don't have a valid variantId. studyLocusId is the result of hashing the values in the study and variant ID columns. When one is not present, the resulting hash is the same.

Expected behaviour This dataset shouldn't have duplicated IDs. A simple solution is to adapt the assing_study_locus_id method to use a random value as variantId when this is missing. As a drawback, this will somehow make our IDs non deterministic, which can make debugging difficult. However, I think the risks are very low, as these are associations that are just carried over without producing insights down the pipeline.