Closed ireneisdoomed closed 1 year ago
The first iteration of changes consists of a general renaming of the column names from snake case to camel case to follow the same patterns as in the Platform datasets.
A sample of all ETL outputs with such change can be found here: gs://ot-team/irene/genetics_data/1
The second iteration of changes consists of specific transformations that are applied to each dataset following the comments from @d0choa in the to_change_v1
sheet of the spreadsheets mentioned above.
These basically consist of:
d2v2g
dataset contains the publication date of the referred study ID. This piece of information is not necessary for the mere purpose of the d2v2g analysis.A sample of all ETL outputs with such change can be found here: gs://ot-team/irene/genetics_data/2
. All datasets are a subset of 500 records.
No more actions needed. This has been superhelpful so far
We want to review the format of all the Genetics datasets to follow a consistent nomenclature and definition both when processing the data in the pipelines, and when exposing the data through the API and ClickHouse.
Background
We are currently in the process of a major Genetics revamp and we want to take the opportunity to review and harmonise the model for each dataset to follow the same conventions.
Schemas review
We have collected in these spreadsheets the current state of the Genetics datasets and we will consequently iterate over them to streamline their structure:
Data processing
Along the reviewal process described above, we also want to produce the proposed iterations to see what the data looks like and to spot incompatibilities/improvements. The code to produce such iterations can be found in this repository genetics_data_revision