Whitespace stripped from XML files can lead to congruence

openml / openml-python

OpenML's Python API for a World of Data and More 💫

http://openml.github.io/openml-python/

Other

279 stars 143 forks source link

Whitespace stripped from XML files can lead to congruence #1125

Open PGijsbers opened 2 years ago

PGijsbers commented 2 years ago

We use xmltodict.parse with the default strip_whitespace=True which can lead to a scenario where the features have categories that don't match the ARFF file categories (e.g., '50000.+' in features and ' 50000.+' in the data). In principle it's an easy fix, but we should take care to test we don't break anything, and I'd propose to check if this "bug" can also lead to issues in reading other XML files.

For more info, see https://github.com/openml/automlbenchmark/issues/350#issuecomment-1004811045

LennartPurucker commented 1 day ago

After looking more into this issue, I think this is not a problem anymore:

import openml

dataset = openml.datasets.get_dataset(4535)
assert dataset.features[41].nominal_values == list(dataset.get_data()[0]["V42"].unique())

LennartPurucker commented 1 day ago

We figured it out!

This bug has just been fixed by the latest release of xmltodict (https://github.com/martinblech/xmltodict/releases/tag/v0.14.1). On a side note, this change was merged two years ago but not released until now (https://github.com/martinblech/xmltodict/pull/267).

However, this might not stay fixed. I will monitor the new related issue (https://github.com/martinblech/xmltodict/issues/361) and then might adapt if needed.

However, I suggest to close all related issues (#1125, https://github.com/openml/automlbenchmark/issues/350#issuecomment-1004811045) and PRs (https://github.com/openml/openml-python/pull/1363, https://github.com/openml/openml-python/pull/1136).

martinblech commented 22 hours ago

@LennartPurucker you were right to assumed that this wouldn't stay "fixed" for too long. I suggest that you explicitly set strip_whitespace=False in your xmltodict.parse calls when whitespace needs to be preserved.

LennartPurucker commented 21 hours ago

Will do @martinblech, thank you!