Column names with '\%' are renamed

joaquinvanschoren commented 3 years ago

Description

Datasets with column names that contain '%' are escaped by pandas, but not in the list of feature names. That creates errors whenever you first get the list of features and then try to look up some of them. Not sure what the best action is. One possible solution is to always escape them, also in the feature list created by get_dataset.

I only found this to happen on one dataset so far, but there may be more. It's not hard to work around this, but it breaks automated tests.

Steps/Code to Reproduce

Example:

import openml
d = openml.datasets.get_dataset(70)
df, *_ = d.get_data(dataset_format="dataframe",include_row_id=True, include_ignore_attribute=True)
print(df.columns)
print([f.name for f in d.features.values()])

Expected Results

Identical lists of feature names

Actual Results

Some feature names are different, e.g. bw\%2Fme, blue\%2Fbright\%2Fvarn\%2Fclean versus bw%2Fme, blue%2Fbright%2Fvarn%2Fclean

Versions

Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic Python 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] NumPy 1.19.5 SciPy 1.4.1 Scikit-Learn 0.22.2.post1 OpenML 0.12.2

PGijsbers commented 3 years ago

I edited your post to use code-snippets, as otherwise the backslashes are not visible which makes the report very confusing :)

I think using the features in their un-escaped form makes more sense as that is their original name. What would the advantage of using their escaped versions be? It looks like the escapes are not present in the features.xml, so we will have to investigate where they are introduced (I suspect the xml reader).

joaquinvanschoren commented 3 years ago

OK, I assumed that pandas did this when creating the dataframe, and that would be hard to work around. If the escaping happens elsewhere and we can avoid it, that's even better!

PGijsbers commented 3 years ago

My bad, I misread your explanation. It looks like Pandas has the escaped feature names because they are already escaped in the ARFF header. This means it's not an openml-python issue, but rather a data/server issue. There's a mismatch between ARFF header and feature.xml on the server. It's possible this is caused by the dataset processing. I'd recommend you move the issue to either repository (based on your judgement).

openml / OpenML