Describe the bugstudyLocusId is not an unique ID in the studyLocus dataset.
Observed behaviour
An example:
spark.read.parquet("gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/credible_set/catalog_curated").filter(f.col("studyId") == "GCST001160_1").filter(f.col("variantId").isNull()).select("studyLocusId", "variantId", "studyId", "pValueMantissa", "pValueExponent", "qualityControls").show(truncate=False)
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
|studyLocusId |variantId|studyId |pValueMantissa|pValueExponent|qualityControls |
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
|6444058520803978058|null |GCST001160_1|2.0 |-12 |[No mapping in GnomAd, Variant not found in LD reference] |
|6444058520803978058|null |GCST001160_1|2.0 |-10 |[No mapping in GnomAd, Variant not found in LD reference] |
|6444058520803978058|null |GCST001160_1|9.0 |-8 |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null |GCST001160_1|1.0 |-6 |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null |GCST001160_1|3.0 |-6 |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null |GCST001160_1|5.0 |-6 |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
As it can be seen, the root cause is that these are associations for which we don't have a valid variantId. studyLocusId is the result of hashing the values in the study and variant ID columns. When one is not present, the resulting hash is the same.
Expected behaviour
This dataset shouldn't have duplicated IDs.
A simple solution is to adapt the assing_study_locus_id method to use a random value as variantId when this is missing.
As a drawback, this will somehow make our IDs non deterministic, which can make debugging difficult. However, I think the risks are very low, as these are associations that are just carried over without producing insights down the pipeline.
Describe the bug
studyLocusId
is not an unique ID in the studyLocus dataset.Observed behaviour An example:
As it can be seen, the root cause is that these are associations for which we don't have a valid variantId.
studyLocusId
is the result of hashing the values in the study and variant ID columns. When one is not present, the resulting hash is the same.Expected behaviour This dataset shouldn't have duplicated IDs. A simple solution is to adapt the assing_study_locus_id method to use a random value as variantId when this is missing. As a drawback, this will somehow make our IDs non deterministic, which can make debugging difficult. However, I think the risks are very low, as these are associations that are just carried over without producing insights down the pipeline.