GWAS Catalog harmonised sumstats are incompatible with `SummaryStatistics` schema

ireneisdoomed commented 7 months ago

Describe the bug Loading the parquet files of the GWASCatalog harmonised summary stats on to a SummaryStatistics object creates an empty dataframe.

Observed behaviour

Loading the content of the directory directly raises value error


from otg.dataset.summary_statistics import SummaryStatistics
from otg.common.session import Session

ss_path = "gs://open-targets-gwas-summary-stats/studies"

session = Session(spark_uri="yarn") ss = SummaryStatistics.from_parquet(session, ss_path)

ValueError: Parquet file is empty: gs://open-targets-gwas-summary-stats/studies

Just reading the parquet files without providing a schema raises "unable to infer schema"

all_ss = session.spark.read.parquet(ss_path, recursiveFileLookUp=True)
>>> AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

Reading the parquet files and providing the summary stats schema returns an empty df.

session.read_parquet(ss_path, schema=SummaryStatistics.get_schema(), recursiveFileLookUp=True).show()
+-------+---------+----------+--------+----+----------+--------------+--------------+-------------------------------+-------------+
|studyId|variantId|chromosome|position|beta|sampleSize|pValueMantissa|pValueExponent|effectAlleleFrequencyFromSource|standardError|
+-------+---------+----------+--------+----+----------+--------------+--------------+-------------------------------+-------------+
+-------+---------+----------+--------+----+----------+--------------+--------------+-------------------------------+-------------+

Reading each study independently and unioning them works but it is very inefficient.


import gcsfs
from functools import reduce
ss_all_studies = [f"gs://{path}" for path in gcsfs.GCSFileSystem().ls(ss_path)]
assert len(ss_all_studies) == 11551

all_dfs = [] for study_path in ss_all_studies: df = session.spark.read.parquet(study_path) all_dfs.append(df)



**Expected behaviour**
If the harmonised summary statistics have more fields than those specified in our schema, `from_parquet` should only load the fields defined in the schema.

**More context**
SummaryStatistics schema has been recently changed (see https://github.com/opentargets/genetics_etl_python/pull/266). The data in the above path has 2 incompatibilities:
- “betaConfidenceIntervalLower”, “betaConfidenceIntervalUpper” are no longer part of the schema
- "sampleSize" is not part of the data

DSuveges commented 7 months ago

The gs://open-targets-gwas-summary-stats/studies is an obsolete dataset, generated based on an old schema. @d0choa is ingesting again following the modifications.

d0choa commented 7 months ago

yesterday news 😂 (literally)

ireneisdoomed commented 7 months ago

Closing. I've tried with a sample of the data that is about to be ingested and it is compatible with the schema. Reading it with from_parquet worked.

opentargets / issues

GWAS Catalog harmonised sumstats are incompatible with `SummaryStatistics` schema #3159