opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

`gwas_study_index` data for backend integration #3357

Open d0choa opened 3 days ago

d0choa commented 3 days ago

Similar to the work on variant index (#3350), we would like to serve a gwas_study_index dataset through OS + API.

This dataset is created by different upstream ETL processes (gentropy), but they all write to the same location and contain a shared schema validated within the ETL. So effectively, they can be considered one single dataset that we will like to load:

❯ gsutil ls gs://genetics_etl_python_playground/releases/24.06/study_index/
gs://genetics_etl_python_playground/releases/24.06/study_index/eqtl_catalogue/
gs://genetics_etl_python_playground/releases/24.06/study_index/finngen/
gs://genetics_etl_python_playground/releases/24.06/study_index/gwas_catalog/

Some stats:

Resolvable entities

Some of these columns are nullable, as described in the table above:

There is a special case for `biosampleFromSourceId. In the future, we might want to resolve this object, but it has some extra complexities that we would like to postpone to a later time.

The latest stable version of the study index aligned with the schema provided above can be found here: gs://genetics_etl_python_playground/releases/24.06/study_index/

Some of the sub-datasets do not present every column. I had to use the next option to read everything spark.read.option('mergeSchema', 'true').parquet(... ). In the future, this dataset might come from one single parquet instead of a directory of parquets with compatible schema.