Similar to the work on variant index (#3350), we would like to serve a gwas_study_index dataset through OS + API.
This dataset is created by different upstream ETL processes (gentropy), but they all write to the same location and contain a shared schema validated within the ETL. So effectively, they can be considered one single dataset that we will like to load:
❯ gsutil ls gs://genetics_etl_python_playground/releases/24.06/study_index/
gs://genetics_etl_python_playground/releases/24.06/study_index/eqtl_catalogue/
gs://genetics_etl_python_playground/releases/24.06/study_index/finngen/
gs://genetics_etl_python_playground/releases/24.06/study_index/gwas_catalog/
Some stats:
Parquet size: 114.5 MiB
Rows (studies): 1_971_058
Study type breakdown:
+---------+-------+
|studyType| count|
+---------+-------+
| sqtl| 214987| -> No traitFromSourceMappedIds or backgroundTraitFromSourceMappedIds
| pqtl| 802| -> No traitFromSourceMappedIds or backgroundTraitFromSourceMappedIds
| tuqtl| 364493| -> No traitFromSourceMappedIds or backgroundTraitFromSourceMappedIds
| eqtl|1299310| -> No traitFromSourceMappedIds or backgroundTraitFromSourceMappedIds
| gwas| 91466| -> Null geneId, Null biosampleFromSourceId, 13_112 Not null backgroundTraitFromSourceMappedIds
+---------+-------+
Resolvable entities
Some of these columns are nullable, as described in the table above:
traitFromSourceMappedIds -> diseases . traitFromSourceMappedIdshould be converted to a diseases array column containing a list of resolvable disease objects.
backgroundTraitFromSourceMappedIds -> backgroundTraits Exactly the same as the above.
geneId -> target. All Ensembl gene IDs. should be converted into resolvable target objects.
There is a special case for `biosampleFromSourceId. In the future, we might want to resolve this object, but it has some extra complexities that we would like to postpone to a later time.
The latest stable version of the study index aligned with the schema provided above can be found here: gs://genetics_etl_python_playground/releases/24.06/study_index/
Some of the sub-datasets do not present every column. I had to use the next option to read everything spark.read.option('mergeSchema', 'true').parquet(... ). In the future, this dataset might come from one single parquet instead of a directory of parquets with compatible schema.
Similar to the work on variant index (#3350), we would like to serve a
gwas_study_index
dataset through OS + API.This dataset is created by different upstream ETL processes (gentropy), but they all write to the same location and contain a shared schema validated within the ETL. So effectively, they can be considered one single dataset that we will like to load:
Some stats:
Resolvable entities
Some of these columns are nullable, as described in the table above:
traitFromSourceMappedIds
->diseases
.traitFromSourceMappedId
should be converted to adiseases
array column containing a list of resolvabledisease
objects.backgroundTraitFromSourceMappedIds
->backgroundTraits
Exactly the same as the above.geneId
->target
. All Ensembl gene IDs. should be converted into resolvabletarget
objects.There is a special case for `biosampleFromSourceId. In the future, we might want to resolve this object, but it has some extra complexities that we would like to postpone to a later time.
The latest stable version of the
study index
aligned with the schema provided above can be found here:gs://genetics_etl_python_playground/releases/24.06/study_index/
Some of the sub-datasets do not present every column. I had to use the next option to read everything spark.read.option('mergeSchema', 'true').parquet(... ). In the future, this dataset might come from one single parquet instead of a directory of parquets with compatible schema.