owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
58 stars 18 forks source link

`tb.format()` sometimes can't find the column when it exists #2874

Closed spoonerf closed 4 days ago

spoonerf commented 6 days ago

Very minor issue, but I've encountered it a few times now so thought it was worth reporting.

Sometimes when using tb.format(["column_x"]), it can't find column_x so I have to use tb.set_index("column_x").sort_index() instead, which works fine.

An example of this issue can be found here

The error shown is:

KeyError                                  Traceback (most recent call last)
<ipython-input-18-b5b028e838ce> in ?()
      6 #
      7 # Process data.
      8 #
      9 # Ensure all columns are snake-case, set an appropriate index, and sort conveniently.
---> 10 tb = tb.format(["Country or Area"])

~/Documents/OWID/repos/etl/lib/catalog/owid/catalog/tables.py in ?(self, keys, verify_integrity, underscore, sort_rows, sort_columns, short_name, **kwargs)
    755             t = t.underscore(**kwargs)
    756         # Set index
    757         if keys is None:
    758             keys = ["country", "year"]
--> 759         t = t.set_index(keys, verify_integrity=verify_integrity)
    760         if sort_columns:
    761             t = t.sort_index(axis=1)
    762         # Sort rows

~/Documents/OWID/repos/etl/lib/catalog/owid/catalog/tables.py in ?(self, keys, **kwargs)
    564             super().set_index(keys, **kwargs)
    565             self.metadata.primary_key = keys
    566             return None
    567         else:
--> 568             t = super().set_index(keys, **kwargs)
...
   6110 
   6111         if inplace:
   6112             frame = self

KeyError: "None of ['Country or Area'] are in the columns"
lucasrodes commented 6 days ago

This is because the first thing that format does is underscoring the column names (and changing some symbols, e.g. ' ' I believe gets converted into underscore).

So in your example, you need to do tb.format(["country_or_area"). Let me know if it works!

@Marigold I wonder if we should change this, so that underscoring is done last? I fear this could affect other steps though.

Marigold commented 6 days ago

I don't have an opinion. Both have pros and cons.

(If we wanted to change it, we'd create a PR, increment ETL_EPOCH, use data-diff to find problematic datasets, revert ETL_EPOCH and merge. It's a bit tedious.)

lucasrodes commented 4 days ago

I'd say let's leave format with the current behaviour, so this error is expected since the column names are no longer valid because of underscoring.

@spoonerf I'm adding a better error message to help users with this error