open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Issue with multiple data inputs #315

Open LucyGwilliamAdmin opened 2 years ago

LucyGwilliamAdmin commented 2 years ago

When I try to input from SDMX and CSV together, this works fine as long as drop_singleton_dimensions is true. Similarly if I only have SDMX input and drop_singleton_dimensions is false, this also works.

So the specific case when an the issue occurs is two data inputs (SDMX and CSV) and drop_singleton_dimensions is false.

Error: https://github.com/uk-sdmx-api-platform/data/runs/7039043267?check_suite_focus=true#step:5:1252 - the data is being printed and there's 2 dfs for 1.1.1 so not sure if this could be something to do with it, since the error refers to a duplicate axis.

Data config file: https://github.com/uk-sdmx-api-platform/data/blob/develop/config_data.yml#L42-L50

brockfanning commented 2 years ago

@LucyGwilliamAdmin This is a strange issue - some of the SDMX-imported columns (in particular OBS_STATUS and UNIT_MULT) are being automatically translated into "Observation status" and "Unit multiplier", because of the DSD translation input. And since 1.1.1 already has those exact columns in the CSV, it is ending up with duplicate columns, which is causing the Python error later in the code.

One option would be to automatically rename any duplicate columns by appending numbers. So for example the duplicate "Observation status" would become "Observation status1". Admittedly that's strange, but at least that would allow the build to complete. Should I try that? Or do you have any other ideas?

LucyGwilliamAdmin commented 2 years ago

I think actually I would like them to be the same since in some cases I'm thinking the only thing that would be different is the reporting type (I.e. if trying to compare national and global data)

Could the data be concatenated to make a single df?

brockfanning commented 2 years ago

@LucyGwilliamAdmin It actually is a single dataframe, but it has duplicate columns. All the options I can think of are:

  1. Renaming the duplicate columns (like described above)
  2. Dropping the duplicate columns (arbitrarily picking one to keep)
  3. Picking one column (arbitrarily) and copying the values from the duplicates onto it. (Eg, copying everything from the second "Unit multiplier" onto the first "Unit multiplier") and then dropping the duplicates.
LucyGwilliamAdmin commented 2 years ago

@brockfanning oh right OK, why does it duplicate the columns though? Can't the rows be appended and where the column name is the same, the values just go in them?

LucyGwilliamAdmin commented 2 years ago

@brockfanning but either way, I think 3 makes the most sense as I imagine that gets the intended result

brockfanning commented 2 years ago

@LucyGwilliamAdmin It's duplicating the columns because of translation. For example the dataframe is full and complete with both a "UNIT_MULT" column and a "Unit multiplier" column, and then some code translates the column names. The "Unit multiplier" column is not affected but the "UNIT_MULT" column gets translated into "Unit multiplier". This is happening because the SDMX DSD is being used as a translation input.

The third option might get the intended result, but is inherently arbitrary. The second option is also arbitrary in same way: there is not a transparently predictable way for the code to decide which column is the "correct" one and which others are duplicates. The way it would probably work in practice is: "the first one is the correct one and any others after that are duplicates").

Just to play devil's advocate - it is definitely needed to have "Unit multiplier" and "Observation status" in the CSV file?

LucyGwilliamAdmin commented 2 years ago

@brockfanning Oh, right I see, so they get translated afterwards - I see

So my concern isn't really with Unit multiplier and Observation status but more with columns like Series and Unit measure

LucyGwilliamAdmin commented 2 years ago

@brockfanning So thinking about what programming languages do generally, 1 is probably the way to go, looks like r and python both use .N e.g. if there's already a column called a, the next one would be could a.1, then a.2, etc.

Just not sure how this solves my problem but maybe that's a separate issue

LucyGwilliamAdmin commented 2 years ago

@brockfanning or is possible for the translation to be done first so when the data is appended, it works out as just the one column, or would this just lead to same issue as 3?

brockfanning commented 2 years ago

@LucyGwilliamAdmin My gut reaction is that that would be a big architectural change, potentially possible but also potentially impossible. It would definitely take some more consideration.

I see what you're saying about Series and Units. Personally I need to understand the use-cases better, I think, to say which approach makes sense. If we need a quick fix right now, my opinion is that option 1 is the only way to (at least) allow the build to complete without having any unexpected consequences.

LucyGwilliamAdmin commented 2 years ago

@brockfanning I agree 1 seems sensible for now!

brockfanning commented 2 years ago

@LucyGwilliamAdmin One idea that struck me for a possible future fix would be a custom translation that overrides the DSD translation, like:

UNIT_MULT: Unit multiplier (global)