Train/test file header may not contain all categories of a categorical variable

mfeurer commented 3 years ago

Hey, I just tried AutoWEKA using the code from #349 and think I found two issues related to arff files and categories. I have only had a look at the KDD Appetency dataset (1111) because AW failed here, but should have produced some results according to your 2019 paper.

Empty numerical columns (i.e. all values are missing) are emitted as string: @attribute Var141 numeric vs @ATTRIBUTE Var141 STRING

If there is an attribute for which a category with the same letters in the same order but different casing exists, the benchmark appears to drop one. This can be seen in Var217, where the original arff file has uUsP and UUSp, but the file dataset_test_0.arff only has the category UUSP. In case you retrieve the categories from the server, this is a server issue, as the server swallows the extra category as can be seen here, most likely it is https://github.com/openml/OpenML/issues/1114. This results in

**** AutoWEKA [vlatest]****

Using 4096MB memory per run on 8 parallel runs.
Running cmd `java -cp /bench/frameworks/AutoWEKA/lib/autoweka/autoweka.jar:/bench/frameworks/AutoWEKA/lib/weka/weka.jar weka.classifiers.meta.AutoWEKAClassifier -t "/input/org/openml/www/datasets/1111/dataset_train_0.arff" -T "/input/org/openml/www/datasets/1111/dataset_test_0.arff" -memLimit 4096 -classifications "weka.classifiers.evaluation.output.prediction.CSV -distribution -file \"/output/predictions/KDDCup09_appetency/0/predictions.weka_pred.csv\"" -timeLimit 60 -parallelRuns 8 -metric areaUnderROC -seed 17193`
java.io.IOException: nominal value not declared in header, read Token[uUsP], line 261
    at weka.core.converters.ArffLoader$ArffReader.errorMessage(ArffLoader.java:354)
    at weka.core.converters.ArffLoader$ArffReader.getInstanceFull(ArffLoader.java:719)
    at weka.core.converters.ArffLoader$ArffReader.getInstance(ArffLoader.java:545)
    at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:514)
    at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:500)
    at weka.core.converters.ArffLoader.getDataSet(ArffLoader.java:1286)
    at weka.core.converters.ConverterUtils$DataSource.getDataSet(ConverterUtils.java:266)
    at weka.core.converters.ConverterUtils$DataSource.getDataSet(ConverterUtils.java:289)
    at weka.classifiers.evaluation.Evaluation.evaluateModel(Evaluation.java:1618)
    at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:668)
    at weka.classifiers.AbstractClassifier.runClassifier(AbstractClassifier.java:141)
    at weka.classifiers.meta.AutoWEKAClassifier.main(AutoWEKAClassifier.java:266)
java.lang.NullPointerException

    at weka.core.Capabilities.test(Capabilities.java:1138)

    at weka.core.Capabilities.testWithFail(Capabilities.java:1468)

    at weka.classifiers.meta.AutoWEKAClassifier.buildClassifier(AutoWEKAClassifier.java:298)

    at weka.classifiers.evaluation.Evaluation.evaluateModel(Evaluation.java:1632)

    at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:668)

    at weka.classifiers.AbstractClassifier.runClassifier(AbstractClassifier.java:141)

    at weka.classifiers.meta.AutoWEKAClassifier.main(AutoWEKAClassifier.java:266)

java.io.IOException: nominal value not declared in header, read Token[uUsP], line 261
    at weka.core.converters.ArffLoader$ArffReader.errorMessage(ArffLoader.java:354)
    at weka.core.converters.ArffLoader$ArffReader.getInstanceFull(ArffLoader.java:719)
    at weka.core.converters.ArffLoader$ArffReader.getInstance(ArffLoader.java:545)
    at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:514)
    at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:500)
    at weka.core.converters.ArffLoader.getDataSet(ArffLoader.java:1286)
    at weka.core.converters.ConverterUtils$DataSource.getDataSet(ConverterUtils.java:266)
    at weka.core.converters.ConverterUtils$DataSource.getDataSet(ConverterUtils.java:289)
    at weka.classifiers.evaluation.Evaluation.evaluateModel(Evaluation.java:1618)
    at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:668)
    at weka.classifiers.AbstractClassifier.runClassifier(AbstractClassifier.java:141)
    at weka.classifiers.meta.AutoWEKAClassifier.main(AutoWEKAClassifier.java:266)
java.lang.NullPointerException
    at weka.core.Capabilities.test(Capabilities.java:1138)
    at weka.core.Capabilities.testWithFail(Capabilities.java:1468)
    at weka.classifiers.meta.AutoWEKAClassifier.buildClassifier(AutoWEKAClassifier.java:298)
    at weka.classifiers.evaluation.Evaluation.evaluateModel(Evaluation.java:1632)
    at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:668)
    at weka.classifiers.AbstractClassifier.runClassifier(AbstractClassifier.java:141)
    at weka.classifiers.meta.AutoWEKAClassifier.main(AutoWEKAClassifier.java:266)

AutoWEKA failed producing any prediction.
Traceback (most recent call last):
  File "/bench/amlb/benchmark.py", line 511, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/bench/frameworks/AutoWEKA/__init__.py", line 10, in run
    return run(*args, **kwargs)
  File "/bench/frameworks/AutoWEKA/exec.py", line 80, in run
    raise NoResultError("AutoWEKA failed producing any prediction.")
amlb.results.NoResultError: AutoWEKA failed producing any prediction.

PGijsbers commented 3 years ago

Since the workshop paper we have indeed done a major overhaul of data loading to rely more on OpenML (#293). Number 2 is a known issue (it's reported in CI for dresses-sales) and indeed what lead me to open #1114. I'll look into issue 1.

mfeurer commented 3 years ago

Thanks a lot for the quick answer. Do you by any chance know if any other dataset from the 2019 benchmark is affected by the 2nd problem?

PGijsbers commented 3 years ago

No, but it should be easy to write a script (download each dataset, compare OpenML feature values to the ARFF header/dataframe content). If it's important to you let me know, otherwise I'd just wait for the fix from OpenML.

PGijsbers commented 3 years ago

Do you happen to know a dataset which has problem 1 (all-null columns) but not problem 2?

mfeurer commented 3 years ago

Do you happen to know a dataset which has problem 1 (all-null columns) but not problem 2?

No, sorry.

No, but it should be easy to write a script (download each dataset, compare OpenML feature values to the ARFF header/dataframe content). If it's important to you let me know, otherwise I'd just wait for the fix from OpenML.

Great idea, I just wrote a brief snippet to do so:

import openml      
import arff                                                                                   
openml_automl_benchmark = [                                                                                                                                                                                  
    189871, 189872, 189873, 168794, 168792, 168793, 75105, 189906, 189909, 189908, 167185, 189874, 189861, 189866,
    168797, 168796, 189860, 189862, 168798, 189865, 126026, 167104, 167083, 189905, 75127, 167200, 167184, 167201,
    168795, 126025, 75097, 167190, 126029, 167149, 167152, 167168, 167181, 75193, 167161                                                                                                                     
] 
for task_id in openml_automl_benchmark:
    task = openml.tasks.get_task(task_id)                                                             
    dataset = task.get_dataset()                                                                                                                                                                             
    try:                                                                                              
        with open('/home/feurerm/.openml/cache/org/openml/www/datasets/%d/dataset_train_0.arff' % dataset.id) as fh:                                                                                         
            arff_array = arff.load(fh)   
    except Exception as e:               
        print(dataset.id, e)                                                                          
        continue                                                                                                                                                                                             
    attributes = {attr[0]: attr[1] for attr in arff_array['attributes']}                              
    for feat in dataset.features:                                                                     
        feat_by_openml = dataset.features[feat].nominal_values                                        
        try:                                                                                                                                                                                                 
            feat_byarff = attributes[dataset.features[feat].name]                                                                                                                                            
        except Exception as e:                                                                        
            print(e, feat, dataset.features[feat].name, task_id)                                      
            continue                                                                                  
        if feat_by_openml is None:       
            assert feat_byarff in ('REAL', 'INTEGER'), (task_id, dataset.features[feat].name, feat_byarff)                                                                                                   
        elif task_id == 167181 and dataset.features[feat].name == 'defects':                                                                                                                                 
            continue                                                                                  
        else:                                                                                         
            assert set(feat_by_openml) == set(feat_byarff), (task_id, dataset.features[feat].name, set(feat_by_openml) ^ set(feat_byarff))

It appears that only KDD is affected by this. And I guess this only affects frameworks working on the arff files as I managed to run the TunedRandomForest without any issues?

PGijsbers commented 3 years ago

And I guess this only affects frameworks working on the arff files

Yes, and more specifically only those that use some internal mechanism that relies on the ARFF header in specific ways. The R-packages for instance also don't have an issue even though they use the ARFF file.

It appears that only KDD is affected by this.

Good to know, thanks!

PGijsbers commented 3 years ago

I made issues/350 which should fix the issue about incorrectly assigning a STRING type to empty numerical columns. However I have been unable to verify it works (in part because of the other problem, but I'm working on circumventing that). Edit: Managed to confirm it works, opened the PR.

PGijsbers commented 3 years ago

Updated the title to reflect the only outstanding issue. Though we hope the fix can come from OpenML instead.

PGijsbers commented 3 years ago

Looks like we probably found the source which has since been updated (link). The last affected datasets should be reset soon and work again. Please remember to clear your cache. I will close this issue once I can verify that AutoWEKA works after the last datasets have been fixed.

Coorsaa commented 2 years ago

We found other issues regarding all factor variables in the census-income dataset, e.g the header of the .arff files declared the target variable as @ATTRIBUTE V42 {'- 50000.', 50000+.} (first is quoted, second is not) Moreover, within @DATA, there are added whitespaces, e.g. ' - 50000.', s.t. the factor level is not declared correctly.

@PGijsbers found out that the JSON response of the OpenML features API is wrong but the actual XML file is correct so it requires a bit more debugging to see exactly what parts of the API we use and how it leads to the mismatch.

PGijsbers commented 2 years ago

TL;DR: It's because of openml-python leaving an xmltodict value on its default value. I have~n't yet~ been able to test the fix with mlr3automl.

I observed that the target feature is not the only feature the exhibits this performance, it seems that:

single-word values (i.e., r'^\s+[\w-]+$') are unquoted
single-word values have leading white spaces trimmed

The ARFF definition specificies that values that contain spaces must be quoted. It does not specify if all values must be quoted if only one contains a space. I think we may conclude that the header would be valid on its own, but not together with the data which has the values with whitespace.

The original header of the OpenML arff file is valid and contains leading whitespace. The provided features.xml is valid and contains leading whitespace (though the JSON does not).

From AMLB perspective, the stripping of white-space is caused at openml-python level:

>>> census.features[41].nominal_values
['- 50000.', '50000+.']

we rely on these features (originating from the XML) to write the ARFF header.

From openml-python perspective, stripping the white-space is actually caused by our use of xmltodict which reads:

import xmltodict
with open(r"~\.openml\org\openml\www\datasets\4535\features.xml", encoding='utf8') as fh:
    xml = fh.read()
census_features = xmltodict.parse(
    xml,
    force_list=("oml:feature", "oml:nominal_value")
)
print(census_features["oml:data_features"]["oml:feature"][1]["oml:nominal_value"])
>>> ['Federal government', 'Local government', ...]

despite the values still containing their whitespace in xml (e.g., <oml:nominal_value> Federal government</oml:nominal_value>). This can be avoided by setting strip_whitespace=False. See also this open issue for a discussion in xmltodict.

@joaquinvanschoren @mfeurer I think we want to update openml-python here and not strip white-space, since it introduces incongruence with the ARFF file.

The (un)quoted behavior is caused by liac-arff:

['- 50000.', '50000+.']     produces   @ATTRIBUTE V42 {'- 50000.', 50000+.}
[' - 50000.', ' 50000+.']   produces   @ATTRIBUTE V42 {' - 50000.', ' 50000+.'}

though as noted before, the header is technically valid on its own (liac-arff will not check it in combination with @DATA). With leading whitespaces restored, the header will again contain quotes and match the data exactly. I verified that this fix allows mlr3automl to create predictions on census-income.

PGijsbers commented 2 years ago

@sebhrusen Even if openml-python updates this, I don't really want to bump the dependency of openml-python mid-experiments. We can monkey patch this in stable-v2 either at framework level, or directly in amlb/datasets/openml.py:

import functools
import xmltodict
xmltodict.parse = functools.partial(xmltodict.parse, strip_whitespace=False)

I'd prefer the global patch probably, even though no other framework seemed to have an issue (yet). Then revert the change when we bump to a version of openml-python with the fix (should they make the change).

sebhrusen commented 2 years ago

@PGijsbers temp monkey patch looks like a good approach

mfeurer commented 2 years ago

I'd be happy to have an update to openml-python. Would you like to create one or open an issue there?

PGijsbers commented 2 years ago

I opened an issue (https://github.com/openml/openml-python/issues/1125). We currently have the monkey patch in the benchmark, so it's not urgent for this application at least., While the proposed fix is quick and easy, I do think we should carefully evaluate we don't break anything and at the same time also check if other xmltodict.parse instances should keep whitespace (and again evaluate if they also break). Especially since the fix requires a cache refresh (at least for features, since they get stored in a pickled file, not just raw xml). This is not something I have time for currently.

openml / automlbenchmark

Train/test file header may not contain all categories of a categorical variable #350