StudyLocus dataset contains multiple rows with the same ID

Describe the bug studyLocusId is not an unique ID in the studyLocus dataset.

Observed behaviour An example:

spark.read.parquet("gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/credible_set/catalog_curated").filter(f.col("studyId") == "GCST001160_1").filter(f.col("variantId").isNull()).select("studyLocusId", "variantId", "studyId", "pValueMantissa", "pValueExponent", "qualityControls").show(truncate=False)
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
|studyLocusId       |variantId|studyId     |pValueMantissa|pValueExponent|qualityControls                                                                  |
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+
|6444058520803978058|null     |GCST001160_1|2.0           |-12           |[No mapping in GnomAd, Variant not found in LD reference]                        |
|6444058520803978058|null     |GCST001160_1|2.0           |-10           |[No mapping in GnomAd, Variant not found in LD reference]                        |
|6444058520803978058|null     |GCST001160_1|9.0           |-8            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null     |GCST001160_1|1.0           |-6            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null     |GCST001160_1|3.0           |-6            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
|6444058520803978058|null     |GCST001160_1|5.0           |-6            |[Subsignificant p-value, No mapping in GnomAd, Variant not found in LD reference]|
+-------------------+---------+------------+--------------+--------------+---------------------------------------------------------------------------------+

As it can be seen, the root cause is that these are associations for which we don't have a valid variantId. studyLocusId is the result of hashing the values in the study and variant ID columns. When one is not present, the resulting hash is the same.

Expected behaviour This dataset shouldn't have duplicated IDs. A simple solution is to adapt the assing_study_locus_id method to use a random value as variantId when this is missing. As a drawback, this will somehow make our IDs non deterministic, which can make debugging difficult. However, I think the risks are very low, as these are associations that are just carried over without producing insights down the pipeline.

opentargets / issues

StudyLocus dataset contains multiple rows with the same ID #3151