Open mfeurer opened 3 years ago
Since the workshop paper we have indeed done a major overhaul of data loading to rely more on OpenML (#293). Number 2 is a known issue (it's reported in CI for dresses-sales
) and indeed what lead me to open #1114. I'll look into issue 1.
Thanks a lot for the quick answer. Do you by any chance know if any other dataset from the 2019 benchmark is affected by the 2nd problem?
No, but it should be easy to write a script (download each dataset, compare OpenML feature values to the ARFF header/dataframe content). If it's important to you let me know, otherwise I'd just wait for the fix from OpenML.
Do you happen to know a dataset which has problem 1 (all-null columns) but not problem 2?
Do you happen to know a dataset which has problem 1 (all-null columns) but not problem 2?
No, sorry.
No, but it should be easy to write a script (download each dataset, compare OpenML feature values to the ARFF header/dataframe content). If it's important to you let me know, otherwise I'd just wait for the fix from OpenML.
Great idea, I just wrote a brief snippet to do so:
import openml
import arff
openml_automl_benchmark = [
189871, 189872, 189873, 168794, 168792, 168793, 75105, 189906, 189909, 189908, 167185, 189874, 189861, 189866,
168797, 168796, 189860, 189862, 168798, 189865, 126026, 167104, 167083, 189905, 75127, 167200, 167184, 167201,
168795, 126025, 75097, 167190, 126029, 167149, 167152, 167168, 167181, 75193, 167161
]
for task_id in openml_automl_benchmark:
task = openml.tasks.get_task(task_id)
dataset = task.get_dataset()
try:
with open('/home/feurerm/.openml/cache/org/openml/www/datasets/%d/dataset_train_0.arff' % dataset.id) as fh:
arff_array = arff.load(fh)
except Exception as e:
print(dataset.id, e)
continue
attributes = {attr[0]: attr[1] for attr in arff_array['attributes']}
for feat in dataset.features:
feat_by_openml = dataset.features[feat].nominal_values
try:
feat_byarff = attributes[dataset.features[feat].name]
except Exception as e:
print(e, feat, dataset.features[feat].name, task_id)
continue
if feat_by_openml is None:
assert feat_byarff in ('REAL', 'INTEGER'), (task_id, dataset.features[feat].name, feat_byarff)
elif task_id == 167181 and dataset.features[feat].name == 'defects':
continue
else:
assert set(feat_by_openml) == set(feat_byarff), (task_id, dataset.features[feat].name, set(feat_by_openml) ^ set(feat_byarff))
It appears that only KDD is affected by this. And I guess this only affects frameworks working on the arff files as I managed to run the TunedRandomForest without any issues?
And I guess this only affects frameworks working on the arff files
Yes, and more specifically only those that use some internal mechanism that relies on the ARFF header in specific ways. The R-packages for instance also don't have an issue even though they use the ARFF file.
It appears that only KDD is affected by this.
Good to know, thanks!
I made issues/350 which should fix the issue about incorrectly assigning a STRING
type to empty numerical
columns. However I have been unable to verify it works (in part because of the other problem, but I'm working on circumventing that).
Edit: Managed to confirm it works, opened the PR.
Updated the title to reflect the only outstanding issue. Though we hope the fix can come from OpenML instead.
Looks like we probably found the source which has since been updated (link). The last affected datasets should be reset soon and work again. Please remember to clear your cache. I will close this issue once I can verify that AutoWEKA works after the last datasets have been fixed.
We found other issues regarding all factor variables in the census-income
dataset, e.g the header of the .arff
files declared the target variable as
@ATTRIBUTE V42 {'- 50000.', 50000+.}
(first is quoted, second is not)
Moreover, within @DATA
, there are added whitespaces, e.g. ' - 50000.'
, s.t. the factor level is not declared correctly.
@PGijsbers found out that the JSON response of the OpenML features API is wrong but the actual XML file is correct so it requires a bit more debugging to see exactly what parts of the API we use and how it leads to the mismatch.
TL;DR: It's because of openml-python
leaving an xmltodict
value on its default value. I have~n't yet~ been able to test the fix with mlr3automl
.
I observed that the target feature is not the only feature the exhibits this performance, it seems that:
r'^\s+[\w-]+$'
) are unquoted The ARFF definition specificies that values that contain spaces must be quoted. It does not specify if all values must be quoted if only one contains a space. I think we may conclude that the header would be valid on its own, but not together with the data which has the values with whitespace.
The original header of the OpenML arff file is valid and contains leading whitespace.
The provided features.xml
is valid and contains leading whitespace (though the JSON does not).
From AMLB perspective, the stripping of white-space is caused at openml-python
level:
>>> census.features[41].nominal_values
['- 50000.', '50000+.']
we rely on these features (originating from the XML) to write the ARFF header.
From openml-python
perspective, stripping the white-space is actually caused by our use of xmltodict
which reads:
import xmltodict
with open(r"~\.openml\org\openml\www\datasets\4535\features.xml", encoding='utf8') as fh:
xml = fh.read()
census_features = xmltodict.parse(
xml,
force_list=("oml:feature", "oml:nominal_value")
)
print(census_features["oml:data_features"]["oml:feature"][1]["oml:nominal_value"])
>>> ['Federal government', 'Local government', ...]
despite the values still containing their whitespace in xml (e.g., <oml:nominal_value> Federal government</oml:nominal_value>
).
This can be avoided by setting strip_whitespace=False
. See also this open issue for a discussion in xmltodict
.
@joaquinvanschoren @mfeurer I think we want to update openml-python
here and not strip white-space, since it introduces incongruence with the ARFF file.
The (un)quoted behavior is caused by liac-arff
:
['- 50000.', '50000+.'] produces @ATTRIBUTE V42 {'- 50000.', 50000+.}
[' - 50000.', ' 50000+.'] produces @ATTRIBUTE V42 {' - 50000.', ' 50000+.'}
though as noted before, the header is technically valid on its own (liac-arff
will not check it in combination with @DATA
).
With leading whitespaces restored, the header will again contain quotes and match the data exactly.
I verified that this fix allows mlr3automl
to create predictions on census-income
.
@sebhrusen Even if openml-python
updates this, I don't really want to bump the dependency of openml-python mid-experiments.
We can monkey patch this in stable-v2
either at framework level, or directly in amlb/datasets/openml.py
:
import functools
import xmltodict
xmltodict.parse = functools.partial(xmltodict.parse, strip_whitespace=False)
I'd prefer the global patch probably, even though no other framework seemed to have an issue (yet).
Then revert the change when we bump to a version of openml-python
with the fix (should they make the change).
@PGijsbers temp monkey patch looks like a good approach
I'd be happy to have an update to openml-python. Would you like to create one or open an issue there?
I opened an issue (https://github.com/openml/openml-python/issues/1125). We currently have the monkey patch in the benchmark, so it's not urgent for this application at least., While the proposed fix is quick and easy, I do think we should carefully evaluate we don't break anything and at the same time also check if other xmltodict.parse
instances should keep whitespace (and again evaluate if they also break). Especially since the fix requires a cache refresh (at least for features, since they get stored in a pickled file, not just raw xml).
This is not something I have time for currently.
Hey, I just tried AutoWEKA using the code from #349 and think I found two issues related to arff files and categories. I have only had a look at the KDD Appetency dataset (1111) because AW failed here, but should have produced some results according to your 2019 paper.
@attribute Var141 numeric
vs@ATTRIBUTE Var141 STRING
If there is an attribute for which a category with the same letters in the same order but different casing exists, the benchmark appears to drop one. This can be seen in Var217, where the original arff file has
uUsP
andUUSp
, but the filedataset_test_0.arff
only has the categoryUUSP
. In case you retrieve the categories from the server, this is a server issue, as the server swallows the extra category as can be seen here, most likely it is https://github.com/openml/OpenML/issues/1114. This results in