Closed ireneisdoomed closed 7 months ago
The gs://open-targets-gwas-summary-stats/studies
is an obsolete dataset, generated based on an old schema. @d0choa is ingesting again following the modifications.
yesterday news 😂 (literally)
Closing. I've tried with a sample of the data that is about to be ingested and it is compatible with the schema. Reading it with from_parquet
worked.
Describe the bug Loading the parquet files of the GWASCatalog harmonised summary stats on to a
SummaryStatistics
object creates an empty dataframe.Observed behaviour
ss_path = "gs://open-targets-gwas-summary-stats/studies"
session = Session(spark_uri="yarn") ss = SummaryStatistics.from_parquet(session, ss_path)
Just reading the parquet files without providing a schema raises "unable to infer schema"
Reading the parquet files and providing the summary stats schema returns an empty df.
Reading each study independently and unioning them works but it is very inefficient.
all_dfs = [] for study_path in ss_all_studies: df = session.spark.read.parquet(study_path) all_dfs.append(df)
all_dfs = reduce(lambda a, b: a.unionByName(b), all_dfs) -RECORD 0-------------------------------------- studyId | GCST000028
variantId | 2_34049_G_A chromosome | 2
position | 34049
pValueMantissa | 5.122
pValueExponent | -1
beta | null
standardError | null
effectAlleleFrequencyFromSource | 0.04955
betaConfidenceIntervalLower | null
betaConfidenceIntervalUpper | null