Open xyg123 opened 2 months ago
The most recent release (24.03) of the study index does not contain EFO mappings for 2,408 studies from FINNGEN. Currently, reading the entire study index with the following command results in no EFO column:
study_index=session.spark.read.parquet(study_index_path, recursiveFileLookup=True) study_index.printSchema()
study_index=session.spark.read.parquet(study_index_path, recursiveFileLookup=True)
study_index.printSchema()
root |-- studyId: string (nullable = true) |-- projectId: string (nullable = true) |-- studyType: string (nullable = true) |-- traitFromSource: string (nullable = true) *This is just the trait name i.e. Depressed affect, mood disorder |-- geneId: string (nullable = true) |-- tissueFromSourceId: string (nullable = true) |-- nSamples: integer (nullable = true) |-- summarystatsLocation: string (nullable = true) |-- hasSumstats: boolean (nullable = true)
This is despite having 79,861 GWAS catalog studies (out of 79,872) WITH EFO mappings in the study index.
We have manually curated EFO mappings for 2,841 FINNGEN studies from the old genetics pipeline.
Changing the studyIds from FINNGENR6... to FINNGENR10... allows a direct recovery of 1,858 (~75%) of EFO mappings for this release.
[ ] How to read in the EFO column ("traitFromSourceMappedIds"), when reading the whole study index?
[ ] Should we include our existing EFO terms in the study index for the next release? If so, where should the logic go?
[ ] Do we want to continue manual curation of EFO terms for FINNGEN studies?
Background
The most recent release (24.03) of the study index does not contain EFO mappings for 2,408 studies from FINNGEN. Currently, reading the entire study index with the following command results in no EFO column:
study_index=session.spark.read.parquet(study_index_path, recursiveFileLookup=True)
study_index.printSchema()
root |-- studyId: string (nullable = true) |-- projectId: string (nullable = true) |-- studyType: string (nullable = true) |-- traitFromSource: string (nullable = true) *This is just the trait name i.e. Depressed affect, mood disorder |-- geneId: string (nullable = true) |-- tissueFromSourceId: string (nullable = true) |-- nSamples: integer (nullable = true) |-- summarystatsLocation: string (nullable = true) |-- hasSumstats: boolean (nullable = true)
This is despite having 79,861 GWAS catalog studies (out of 79,872) WITH EFO mappings in the study index.
We have manually curated EFO mappings for 2,841 FINNGEN studies from the old genetics pipeline.
Changing the studyIds from FINNGENR6... to FINNGENR10... allows a direct recovery of 1,858 (~75%) of EFO mappings for this release.
Tasks
[ ] How to read in the EFO column ("traitFromSourceMappedIds"), when reading the whole study index?
[ ] Should we include our existing EFO terms in the study index for the next release? If so, where should the logic go?
[ ] Do we want to continue manual curation of EFO terms for FINNGEN studies?