semio / ddf_utils

Utilities for working with DDF datasets
https://open-numbers.github.io/
MIT License
3 stars 1 forks source link

handling merge/split of a country #58

Open semio opened 7 years ago

semio commented 7 years ago

Problem:

see https://github.com/open-numbers/ddf--gapminder--co2_emission/issues/1

semio commented 7 years ago

current design:

merge

- procedure: merge_entity
  ingredients:
      - input_ingredient
  options:
      dictionary: merge.json
      merged: keep    # what to do with the entities to be merged
      target_column: entity_name
  result: output_ingredient

in merge.json:

{
    "new_entity_1": ["old_entity_1", "old_entity_2"],
    "new_entity_2": ["old_entity_3", "old_entity_4"]
}

split

- procedure: split_entity
  ingredients:
      - input_ingredient
  options:
      dictionary: split.json
      splitted: keep   # what to do with the entities to be splitted 
      target_column: entity_name
  result: input_ingredient

in split.json:

{
    "entity_to_split_1":  ["sub_entity_1", "sub_entity_2"],
    "entity_to_split_2":  ["sub_entity_3", "sub_entity_4"]
}

This assumes sub_entity_1 to sub_entity_4 exists in the dataset. The split ratio will be calculated with first valid values form the sub entities