Open joaquinvanschoren opened 3 years ago
I edited your post to use code-snippets, as otherwise the backslashes are not visible which makes the report very confusing :)
I think using the features in their un-escaped form makes more sense as that is their original name. What would the advantage of using their escaped versions be? It looks like the escapes are not present in the features.xml, so we will have to investigate where they are introduced (I suspect the xml reader).
OK, I assumed that pandas did this when creating the dataframe, and that would be hard to work around. If the escaping happens elsewhere and we can avoid it, that's even better!
My bad, I misread your explanation. It looks like Pandas has the escaped feature names because they are already escaped in the ARFF header. This means it's not an openml-python
issue, but rather a data/server issue. There's a mismatch between ARFF header and feature.xml on the server. It's possible this is caused by the dataset processing. I'd recommend you move the issue to either repository (based on your judgement).
Description
Datasets with column names that contain '%' are escaped by pandas, but not in the list of feature names. That creates errors whenever you first get the list of features and then try to look up some of them. Not sure what the best action is. One possible solution is to always escape them, also in the feature list created by get_dataset.
I only found this to happen on one dataset so far, but there may be more. It's not hard to work around this, but it breaks automated tests.
Steps/Code to Reproduce
Example:
Expected Results
Identical lists of feature names
Actual Results
Some feature names are different, e.g.
bw\%2Fme
,blue\%2Fbright\%2Fvarn\%2Fclean
versusbw%2Fme
,blue%2Fbright%2Fvarn%2Fclean
Versions
Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic Python 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] NumPy 1.19.5 SciPy 1.4.1 Scikit-Learn 0.22.2.post1 OpenML 0.12.2