opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

All GWAS Catalog studies are marked as invalid #3559

Closed ireneisdoomed closed 1 month ago

ireneisdoomed commented 1 month ago

Describe the bug All GCST studies (102,780) have been marked as invalid. This has consequences in all datasets downstream.

Observed behaviour The drop is explained by the QC flags:

invalid_studies = spark.read.parquet("gs://ot_orchestration/releases/27.09/invalid_study_index/")
invalid_studies.filter(f.col("studyId").startswith("GCST")).select(f.explode("qualityControls").alias("qc")).groupBy("qc").count().orderBy(f.col("count").desc()).show(truncate=False)
+----------------------------------------------------+------+                   
|qc                                                  |count |
+----------------------------------------------------+------+
|Biosample identifier was not found in the reference.|102780|
|No valid disease identifier found.                  |7968  |
|The identifier of this study is not unique.         |142   |
+----------------------------------------------------+------+

Expected behaviour We shouldn't drop any GWAS Catalog study based on the biosample index.

Tobi1kenobi commented 1 month ago

This is fixed here, will make a PR soon when I've added a test to make sure this won't happen again.