Open PGijsbers opened 2 years ago
After looking more into this issue, I think this is not a problem anymore:
import openml
dataset = openml.datasets.get_dataset(4535)
assert dataset.features[41].nominal_values == list(dataset.get_data()[0]["V42"].unique())
We figured it out!
This bug has just been fixed by the latest release of xmltodict
(https://github.com/martinblech/xmltodict/releases/tag/v0.14.1). On a side note, this change was merged two years ago but not released until now (https://github.com/martinblech/xmltodict/pull/267).
However, this might not stay fixed. I will monitor the new related issue (https://github.com/martinblech/xmltodict/issues/361) and then might adapt if needed.
However, I suggest to close all related issues (#1125, https://github.com/openml/automlbenchmark/issues/350#issuecomment-1004811045) and PRs (https://github.com/openml/openml-python/pull/1363, https://github.com/openml/openml-python/pull/1136).
@LennartPurucker you were right to assumed that this wouldn't stay "fixed" for too long. I suggest that you explicitly set strip_whitespace=False
in your xmltodict.parse
calls when whitespace needs to be preserved.
Will do @martinblech, thank you!
We use
xmltodict.parse
with the defaultstrip_whitespace=True
which can lead to a scenario where the features have categories that don't match the ARFF file categories (e.g., '50000.+' in features and ' 50000.+' in the data). In principle it's an easy fix, but we should take care to test we don't break anything, and I'd propose to check if this "bug" can also lead to issues in reading other XML files.For more info, see https://github.com/openml/automlbenchmark/issues/350#issuecomment-1004811045