Streamline Genetics data models

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Streamline Genetics data models #2723

Closed ireneisdoomed closed 1 year ago

ireneisdoomed commented 2 years ago

We want to review the format of all the Genetics datasets to follow a consistent nomenclature and definition both when processing the data in the pipelines, and when exposing the data through the API and ClickHouse.

Background

We are currently in the process of a major Genetics revamp and we want to take the opportunity to review and harmonise the model for each dataset to follow the same conventions.

Schemas review

We have collected in these spreadsheets the current state of the Genetics datasets and we will consequently iterate over them to streamline their structure:

Review of the ETL outputs schemas: https://docs.google.com/spreadsheets/d/17Rjyv6YB5iXk0831gGvUNLgTaM2_jjTmp517Xg7iWSU/edit?usp=sharing
Review of the ClickHouse schemas: https://docs.google.com/spreadsheets/d/1btfAb42NsKBumDr7YCubo7xA3TA7dvta2UBG1aVIgLU/edit?usp=sharing

Data processing

Along the reviewal process described above, we also want to produce the proposed iterations to see what the data looks like and to spot incompatibilities/improvements. The code to produce such iterations can be found in this repository genetics_data_revision

ireneisdoomed commented 2 years ago

The first iteration of changes consists of a general renaming of the column names from snake case to camel case to follow the same patterns as in the Platform datasets.

A sample of all ETL outputs with such change can be found here: gs://ot-team/irene/genetics_data/1

ireneisdoomed commented 2 years ago

The second iteration of changes consists of specific transformations that are applied to each dataset following the comments from @d0choa in the to_change_v1 sheet of the spreadsheets mentioned above.

These basically consist of:

a revision of redundant information/metadata in some datasets that could be resolved by joining independent datasets on primary keys like the study or variant ID. To give an example, the d2v2g dataset contains the publication date of the referred study ID. This piece of information is not necessary for the mere purpose of the d2v2g analysis.
the creation of a variant id that facilitates the analysis, instead of continuously (and inconsistently) using 4 columns (chrom, pos, ref, alt) to describe a variant.

A sample of all ETL outputs with such change can be found here: gs://ot-team/irene/genetics_data/2. All datasets are a subset of 500 records.

d0choa commented 1 year ago

No more actions needed. This has been superhelpful so far