opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

GWAS Catalog harmonised sumstats are incompatible with `SummaryStatistics` schema #3159

Closed ireneisdoomed closed 7 months ago

ireneisdoomed commented 7 months ago

Describe the bug Loading the parquet files of the GWASCatalog harmonised summary stats on to a SummaryStatistics object creates an empty dataframe.

Observed behaviour

  1. Loading the content of the directory directly raises value error
    
    from otg.dataset.summary_statistics import SummaryStatistics
    from otg.common.session import Session

ss_path = "gs://open-targets-gwas-summary-stats/studies"

session = Session(spark_uri="yarn") ss = SummaryStatistics.from_parquet(session, ss_path)

ValueError: Parquet file is empty: gs://open-targets-gwas-summary-stats/studies

  1. Just reading the parquet files without providing a schema raises "unable to infer schema"

    all_ss = session.spark.read.parquet(ss_path, recursiveFileLookUp=True)
    >>> AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
  2. Reading the parquet files and providing the summary stats schema returns an empty df.

    session.read_parquet(ss_path, schema=SummaryStatistics.get_schema(), recursiveFileLookUp=True).show()
    +-------+---------+----------+--------+----+----------+--------------+--------------+-------------------------------+-------------+
    |studyId|variantId|chromosome|position|beta|sampleSize|pValueMantissa|pValueExponent|effectAlleleFrequencyFromSource|standardError|
    +-------+---------+----------+--------+----+----------+--------------+--------------+-------------------------------+-------------+
    +-------+---------+----------+--------+----+----------+--------------+--------------+-------------------------------+-------------+
  3. Reading each study independently and unioning them works but it is very inefficient.

    
    import gcsfs
    from functools import reduce
    ss_all_studies = [f"gs://{path}" for path in gcsfs.GCSFileSystem().ls(ss_path)]
    assert len(ss_all_studies) == 11551

all_dfs = [] for study_path in ss_all_studies: df = session.spark.read.parquet(study_path) all_dfs.append(df)

all_dfs = reduce(lambda a, b: a.unionByName(b), all_dfs) -RECORD 0-------------------------------------- studyId | GCST000028
variantId | 2_34049_G_A chromosome | 2
position | 34049
pValueMantissa | 5.122
pValueExponent | -1
beta | null
standardError | null
effectAlleleFrequencyFromSource | 0.04955
betaConfidenceIntervalLower | null
betaConfidenceIntervalUpper | null



**Expected behaviour**
If the harmonised summary statistics have more fields than those specified in our schema, `from_parquet` should only load the fields defined in the schema.

**More context**
SummaryStatistics schema has been recently changed (see https://github.com/opentargets/genetics_etl_python/pull/266). The data in the above path has 2 incompatibilities:
- “betaConfidenceIntervalLower”, “betaConfidenceIntervalUpper” are no longer part of the schema
- "sampleSize" is not part of the data
DSuveges commented 7 months ago

The gs://open-targets-gwas-summary-stats/studies is an obsolete dataset, generated based on an old schema. @d0choa is ingesting again following the modifications.

d0choa commented 7 months ago

yesterday news 😂 (literally)

ireneisdoomed commented 7 months ago

Closing. I've tried with a sample of the data that is about to be ingested and it is compatible with the schema. Reading it with from_parquet worked.