Recover EFO mappings for FINNGEN studies

Background

The most recent release (24.03) of the study index does not contain EFO mappings for 2,408 studies from FINNGEN. Currently, reading the entire study index with the following command results in no EFO column:

study_index=session.spark.read.parquet(study_index_path, recursiveFileLookup=True) study_index.printSchema()

root |-- studyId: string (nullable = true) |-- projectId: string (nullable = true) |-- studyType: string (nullable = true) |-- traitFromSource: string (nullable = true) *This is just the trait name i.e. Depressed affect, mood disorder |-- geneId: string (nullable = true) |-- tissueFromSourceId: string (nullable = true) |-- nSamples: integer (nullable = true) |-- summarystatsLocation: string (nullable = true) |-- hasSumstats: boolean (nullable = true)

This is despite having 79,861 GWAS catalog studies (out of 79,872) WITH EFO mappings in the study index.

We have manually curated EFO mappings for 2,841 FINNGEN studies from the old genetics pipeline.

Changing the studyIds from FINNGENR6... to FINNGENR10... allows a direct recovery of 1,858 (~75%) of EFO mappings for this release.

Tasks

[ ] How to read in the EFO column ("traitFromSourceMappedIds"), when reading the whole study index?
[ ] Should we include our existing EFO terms in the study index for the next release? If so, where should the logic go?
[ ] Do we want to continue manual curation of EFO terms for FINNGEN studies?

opentargets / issues

Recover EFO mappings for FINNGEN studies #3280

Background

Tasks