Open LucyGwilliamAdmin opened 2 years ago
@LucyGwilliamAdmin This is a strange issue - some of the SDMX-imported columns (in particular OBS_STATUS and UNIT_MULT) are being automatically translated into "Observation status" and "Unit multiplier", because of the DSD translation input. And since 1.1.1 already has those exact columns in the CSV, it is ending up with duplicate columns, which is causing the Python error later in the code.
One option would be to automatically rename any duplicate columns by appending numbers. So for example the duplicate "Observation status" would become "Observation status1". Admittedly that's strange, but at least that would allow the build to complete. Should I try that? Or do you have any other ideas?
I think actually I would like them to be the same since in some cases I'm thinking the only thing that would be different is the reporting type (I.e. if trying to compare national and global data)
Could the data be concatenated to make a single df?
@LucyGwilliamAdmin It actually is a single dataframe, but it has duplicate columns. All the options I can think of are:
@brockfanning oh right OK, why does it duplicate the columns though? Can't the rows be appended and where the column name is the same, the values just go in them?
@brockfanning but either way, I think 3 makes the most sense as I imagine that gets the intended result
@LucyGwilliamAdmin It's duplicating the columns because of translation. For example the dataframe is full and complete with both a "UNIT_MULT" column and a "Unit multiplier" column, and then some code translates the column names. The "Unit multiplier" column is not affected but the "UNIT_MULT" column gets translated into "Unit multiplier". This is happening because the SDMX DSD is being used as a translation input.
The third option might get the intended result, but is inherently arbitrary. The second option is also arbitrary in same way: there is not a transparently predictable way for the code to decide which column is the "correct" one and which others are duplicates. The way it would probably work in practice is: "the first one is the correct one and any others after that are duplicates").
Just to play devil's advocate - it is definitely needed to have "Unit multiplier" and "Observation status" in the CSV file?
@brockfanning Oh, right I see, so they get translated afterwards - I see
So my concern isn't really with Unit multiplier and Observation status but more with columns like Series and Unit measure
@brockfanning So thinking about what programming languages do generally, 1 is probably the way to go, looks like r and python both use .N e.g. if there's already a column called a, the next one would be could a.1, then a.2, etc.
Just not sure how this solves my problem but maybe that's a separate issue
@brockfanning or is possible for the translation to be done first so when the data is appended, it works out as just the one column, or would this just lead to same issue as 3?
@LucyGwilliamAdmin My gut reaction is that that would be a big architectural change, potentially possible but also potentially impossible. It would definitely take some more consideration.
I see what you're saying about Series and Units. Personally I need to understand the use-cases better, I think, to say which approach makes sense. If we need a quick fix right now, my opinion is that option 1 is the only way to (at least) allow the build to complete without having any unexpected consequences.
@brockfanning I agree 1 seems sensible for now!
@LucyGwilliamAdmin One idea that struck me for a possible future fix would be a custom translation that overrides the DSD translation, like:
UNIT_MULT: Unit multiplier (global)
When I try to input from SDMX and CSV together, this works fine as long as
drop_singleton_dimensions
is true. Similarly if I only have SDMX input anddrop_singleton_dimensions
is false, this also works.So the specific case when an the issue occurs is two data inputs (SDMX and CSV) and
drop_singleton_dimensions
is false.Error: https://github.com/uk-sdmx-api-platform/data/runs/7039043267?check_suite_focus=true#step:5:1252 - the data is being printed and there's 2 dfs for 1.1.1 so not sure if this could be something to do with it, since the error refers to a duplicate axis.
Data config file: https://github.com/uk-sdmx-api-platform/data/blob/develop/config_data.yml#L42-L50