maelstrom-research / Rmonize

3 stars 0 forks source link

need to start processing data from empty or missing material #56

Closed GuiFabre closed 2 months ago

GuiFabre commented 3 months ago

General Paradigm A : A blank harmonzation can be performed from any dataset list

Statement 1 Since any tibble is a dataset, 1 tibbles must be processed. A minimum processing means that 1 tibble, no matter there column name, length and type must be abble to be put in a common format. To avoid loosing potential information, an index of each row can always be created. Then harmo process must be equivalent to bind_rows, when datasets are kept in a list. Any column in commun can be stacked, the rest of them can be put one after the other.

image

image

image

Statement 2 Since in one tibble, not any modification of a dataset is traceable, if an dataset in output is generated from the same dataset in input, we can always (at least) say that the name of each column is the same, and the values are reported from the input to the output. Then, this is equivalent to a direct mapping. Since 2 tibble can be put in common format, if a column exists in tibble 1, but ont in tibble 2, then, it is equivalent to "impossible" in tibble 1 and "direct_mapping" in tibble 2.

Statement 3 The name of the output dataset can be the same or at least "dataset.1", "dataset.2", ... The columns used in tibble 1 and 2 can be reported as direct_mapping() and impossible() Index created (if id_col = NULL) can be used as an ID column for each dataset. Then the minimum information necessary to create a data processing element are present.

image

Statement 4 Since from any dataset you can extract a data dictionary. Then, from any dataset put in common format you can extract common formated data dictionary. Called "harmonized data dictionaries".

image

So from any harmonized dataset, a common data schema can be created, which is a DataSchema.

image

GuiFabre commented 3 months ago

General Paradigm B : A blank harmonzation can be performed from any Data Schema.

GuiFabre commented 3 months ago

General Paradigm C : A blank harmonzation can be performed from any data processing elements.

a-trottier commented 2 months ago

It works as intended. We should discuss further improvement for the debug option (in theory it should test the inputs entered). Good to go for this update, it is already gives some interesting flexibility.

GuiFabre commented 2 months ago

great ! love this new feature