Describe the bug
All GCST studies (102,780) have been marked as invalid. This has consequences in all datasets downstream.
Observed behaviour
The drop is explained by the QC flags:
invalid_studies = spark.read.parquet("gs://ot_orchestration/releases/27.09/invalid_study_index/")
invalid_studies.filter(f.col("studyId").startswith("GCST")).select(f.explode("qualityControls").alias("qc")).groupBy("qc").count().orderBy(f.col("count").desc()).show(truncate=False)
+----------------------------------------------------+------+
|qc |count |
+----------------------------------------------------+------+
|Biosample identifier was not found in the reference.|102780|
|No valid disease identifier found. |7968 |
|The identifier of this study is not unique. |142 |
+----------------------------------------------------+------+
Expected behaviour
We shouldn't drop any GWAS Catalog study based on the biosample index.
Describe the bug All GCST studies (102,780) have been marked as invalid. This has consequences in all datasets downstream.
Observed behaviour The drop is explained by the QC flags:
Expected behaviour We shouldn't drop any GWAS Catalog study based on the biosample index.