opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Streamline Genetics data models #2723

Closed ireneisdoomed closed 1 year ago

ireneisdoomed commented 2 years ago

We want to review the format of all the Genetics datasets to follow a consistent nomenclature and definition both when processing the data in the pipelines, and when exposing the data through the API and ClickHouse.

Background

We are currently in the process of a major Genetics revamp and we want to take the opportunity to review and harmonise the model for each dataset to follow the same conventions.

Schemas review

We have collected in these spreadsheets the current state of the Genetics datasets and we will consequently iterate over them to streamline their structure:

Data processing

Along the reviewal process described above, we also want to produce the proposed iterations to see what the data looks like and to spot incompatibilities/improvements. The code to produce such iterations can be found in this repository genetics_data_revision

ireneisdoomed commented 2 years ago

The first iteration of changes consists of a general renaming of the column names from snake case to camel case to follow the same patterns as in the Platform datasets.

A sample of all ETL outputs with such change can be found here: gs://ot-team/irene/genetics_data/1

ireneisdoomed commented 2 years ago

The second iteration of changes consists of specific transformations that are applied to each dataset following the comments from @d0choa in the to_change_v1 sheet of the spreadsheets mentioned above.

These basically consist of:

A sample of all ETL outputs with such change can be found here: gs://ot-team/irene/genetics_data/2. All datasets are a subset of 500 records.

d0choa commented 1 year ago

No more actions needed. This has been superhelpful so far