openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
396 stars 132 forks source link

Failures on several new datasets #223

Open mfeurer opened 3 years ago

mfeurer commented 3 years ago

Hi together,

we ran Auto-sklearn on all new datasets from studies 271 and 269 and realized that there are some failures due to datasets containing strings and the unknown categories during test time:

Dataset 359948

[ERROR] [amlb.benchmark:22:48:31.978] could not convert string to float: './SAT09/CRAFTED/rbsat/crafted/forced/rbsat-v2640c305320gyes10.cnf-1-MPhaseSAT_2011.02.15' Traceback (most recent call last): File "/bench/amlb/benchmark.py", line 414, in run meta_result = framework.run(self._dataset, task_config) File "/bench/frameworks/autosklearn_1032/init.py", line 16, in run X_enc=dataset.train.X_enc, File "/bench/amlb/utils/cache.py", line 73, in decorator return cache(self, prop_name, prop_fn) File "/bench/amlb/utils/cache.py", line 30, in cache value = fn(self) File "/bench/amlb/utils/process.py", line 520, in profiler return fn(*args, *kwargs) File "/bench/amlb/data.py", line 143, in X_enc return self.data_enc[:, predictors_ind] File "/bench/amlb/utils/cache.py", line 73, in decorator return cache(self, prop_name, prop_fn) File "/bench/amlb/utils/cache.py", line 30, in cache value = fn(self) File "/bench/amlb/utils/process.py", line 520, in profiler return fn(args, **kwargs) File "/bench/amlb/data.py", line 130, in data_enc encoded_cols = [f.label_encoder.transform(self.data[:, f.index]) for f in self.dataset.features] File "/bench/amlb/data.py", line 130, in encoded_cols = [f.label_encoder.transform(self.data[:, f.index]) for f in self.dataset.features] File "/bench/amlb/datautils.py", line 246, in transform return return_value(vec.astype(self.encoded_type, copy=False)) ValueError: could not convert string to float: './SAT09/CRAFTED/rbsat/crafted/forced/rbsat-v2640c305320gyes10.cnf-1-MPhaseSAT_2011.02.15'

Similar issues in datasets 359947, 359945 and 359942.

Dataset 359991

[ERROR] [amlb.benchmark:20:26:56.796] Found unknown categories ['S'] in column 0 during transform Traceback (most recent call last): File "/bench/amlb/benchmark.py", line 414, in run meta_result = framework.run(self._dataset, task_config) File "/bench/frameworks/autosklearn_1032/init.py", line 16, in run X_enc=dataset.train.X_enc, File "/bench/amlb/utils/cache.py", line 73, in decorator return cache(self, prop_name, prop_fn) File "/bench/amlb/utils/cache.py", line 30, in cache value = fn(self) File "/bench/amlb/utils/process.py", line 520, in profiler return fn(*args, *kwargs) File "/bench/amlb/data.py", line 143, in X_enc return self.data_enc[:, predictors_ind] File "/bench/amlb/utils/cache.py", line 73, in decorator return cache(self, prop_name, prop_fn) File "/bench/amlb/utils/cache.py", line 30, in cache value = fn(self) File "/bench/amlb/utils/process.py", line 520, in profiler return fn(args, kwargs) File "/bench/amlb/data.py", line 130, in data_enc encoded_cols = [f.label_encoder.transform(self.data[:, f.index]) for f in self.dataset.features] File "/bench/amlb/data.py", line 130, in encoded_cols = [f.label_encoder.transform(self.data[:, f.index]) for f in self.dataset.features] File "/bench/amlb/datautils.py", line 257, in transform res = self.delegate.transform(self._reshape(vec), params).astype(self.encoded_type, copy=False) File "/bench/venv/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 700, in transform Xint, = self._transform(X) File "/bench/venv/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['S'] in column 0 during transform

Similar issue for dataset 211986

sebhrusen commented 3 years ago

Hi @mfeurer, may i ask you how did you run autosklearn against those datasets? Did you use the latest code on master?

2e41ca1 (HEAD -> master, origin/master, origin/HEAD) increase default activity timeout to allow installation of large libraries when building docker image (#222)

I just tried with a docker image I built recently for autosklearn and it seems to work:

                    id  task    framework constraint fold    result metric    mode version params       app_version                  utc  duration  training_duration  predict_duration models_count        seed       acc       auc    balacc   logloss
0  openml.org/t/359991  kick  autosklearn       test    0  0.780521    auc  docker  0.12.0         dev [NA, NA, NA]  2020-12-14T14:49:28     696.5              613.3               2.4            5  2144378946  0.902726  0.780521  0.623346  0.371129
1  openml.org/t/359991  kick  autosklearn       test    1  0.784133    auc  docker  0.12.0         dev [NA, NA, NA]  2020-12-14T14:59:57     629.2              606.9               2.5            4  2144378947  0.902726  0.784133  0.625739  0.355173
sebhrusen commented 3 years ago
ValueError: Found unknown categories ['S'] in column 0 during transform

This dataset has a categorcal column represented as:

@attribute 'Trim' {'1','150','2','250','3','3 R','Adv','Bas','C','Car','CE','Cin','Cla','Cus','CX','CXL','CXS','DE','Den','DS','Dur','DX','eC','Edd','Edg','eL','Ent','ES','EX','EX-','Exe','FX4','GL','GLE','GLS','GS','GT','GTC','GTP','GTS','GX','GXE','GXP','Har','Her','Hig','Hyb','i','JLS','JLX','Kin','L','L 3','L10','L20','L30','Lar','LE','Lim','LL','LS','LT','LTZ','Lux','LW2','LW3','LX','LXi','Max','Maz','Nor','Out','Ove','OZ','Plu','Pre','Pro','R/T','Ral','Ren','RS','RT','s','S','SC1','SC2','SE','SE-','SEL','SES','Si','Sig','SL','SL1','SL2','SLE','SLT','Spe','Spo','Spy','SR5','SS','ST','Sta','STX','SV6','SVT','SX','SXT','T5','Tou','Ult','Val','VP','W/T','X','XE','XL','XLS','XLT','XR','XRS','XS','Xsp','Z24','Z71','ZR2','ZTS','ZTW','ZX2','ZX3','ZX4','ZX5','ZXW'}

meaning that here we have both 's' and 'S' as categorical values. The arff file makes the distinction but openml metadata don't... leading to your error. This has been fixed with https://github.com/openml/automlbenchmark/pull/208 Of course, the assumption here is that categorical values should be case-insensitive. I hope we don't have any examples where it shouldn't be the case.

Another issues that was occuring with https://www.openml.org/t/211986: https://github.com/openml/automlbenchmark/pull/209

All those fixes are now in master.

sebhrusen commented 3 years ago

Data associated to openml/t/359948 has been deactivated: https://www.openml.org/d/23701 : I can't run it anymore. @PGijsbers is it on purpose? Also, data associated to openml/t/359942 has a string column: the app doesn't currently support those columns (won't be supported before https://github.com/openml/automlbenchmark/issues/116): those datasets should be removed.

What's happening with string columns currently is that the ARFF file and/or openml don't provide the list of categories (by definition...), so the app fails when trying to label encode those. Suggestions (sorted according to my preference):

  1. remove those datasets.
  2. translate the entire column to nans for frameworks requiring numericals.
  3. extract unique values from the data (can be huge) to allow encoding.
PGijsbers commented 3 years ago

I am not sure why dataset 23701 is deactivated, I'll ask Joaquin.

The college dataset only has string values on attributes that should be ignored, they are marked as such on openml. All other attributes are either nominal or numeric. Our framework should ignore features labeled as "ignore" under the OpenMLDataset.ignore_attributes:

>>> import openml
>>> d = openml.datasets.get_dataset(42727)
>>> d.ignore_attribute
['school_name', 'school_webpage']
sebhrusen commented 3 years ago

Our framework should ignore features labeled as "ignore"

@PGijsbers, agree and the fix was surprisingly not trivial: https://github.com/openml/automlbenchmark/pull/224

mfeurer commented 3 years ago

Hi @mfeurer, may i ask you how did you run autosklearn against those datasets? Did you use the latest code on master?

No, but I'm doing that now.

  1. python runbenchmark.py autosklearn openml/t/359948 -m local -f 0

gives the first error message, and the 3 other tasks I mentioned fail as well. But it appears that you have found this issue now as well and are about to fix it.

  1. Issues with datasets 359991 and 211986 are now fixed by using the master branch.
PGijsbers commented 3 years ago

@sebhrusen The deactivated dataset (https://www.openml.org/d/23701) is not associated with openml/t/359948. According to Joaquin that dataset has been deactivated for years.

sebhrusen commented 3 years ago

@PGijsbers typo on my side... I was running openml/t/259948!

PGijsbers commented 3 years ago

To which benchmark should that task belong? I don't see it in any of /s/218, /s/269, /s/270 and /s/271.

sebhrusen commented 3 years ago

To which benchmark should that task belong? I don't see it in any of /s/218, /s/269, /s/270 and /s/271.

None, It was a typo! Based on the failing tasks described in this ticket I ran openml/t/259948 instead of openml/t/359948. And the fact that it was failing too probably confirmed me that it was the reason.

mfeurer commented 3 years ago

I just tried the new tasks and it turns out that task 360115 has several string features which cannot be handled by the benchmark code itself:

[ERROR] [amlb.benchmark:13:35:02.848] could not convert string to float: 'crvh'
Traceback (most recent call last):
  File "/bench/amlb/benchmark.py", line 444, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/bench/frameworks/autosklearn/__init__.py", line 16, in run
    X_enc=dataset.train.X_enc,
  File "/bench/amlb/utils/cache.py", line 73, in decorator
    return cache(self, prop_name, prop_fn)
  File "/bench/amlb/utils/cache.py", line 30, in cache
    value = fn(self)
  File "/bench/amlb/utils/process.py", line 521, in profiler
    return fn(*args, **kwargs)
  File "/bench/amlb/data.py", line 149, in X_enc
    return self.data_enc[:, predictors_ind]
  File "/bench/amlb/utils/cache.py", line 73, in decorator
    return cache(self, prop_name, prop_fn)
  File "/bench/amlb/utils/cache.py", line 30, in cache
    value = fn(self)
  File "/bench/amlb/utils/process.py", line 521, in profiler
    return fn(*args, **kwargs)
  File "/bench/amlb/data.py", line 136, in data_enc
    encoded_cols = [f.label_encoder.transform(self.data[:, f.index]) for f in self.dataset.features]
  File "/bench/amlb/data.py", line 136, in <listcomp>
    encoded_cols = [f.label_encoder.transform(self.data[:, f.index]) for f in self.dataset.features]
  File "/bench/amlb/datautils.py", line 247, in transform
    return return_value(vec.astype(self.encoded_type, copy=False))
ValueError: could not convert string to float: 'crvh'

Excerpt from the features.xml

<oml:feature>
    <oml:index>14740</oml:index>
    <oml:name>Var14741</oml:name>
    <oml:data_type>string</oml:data_type>
        <oml:is_target>false</oml:is_target>
    <oml:is_ignore>false</oml:is_ignore>
    <oml:is_row_identifier>false</oml:is_row_identifier>
    <oml:number_of_missing_values>36726</oml:number_of_missing_values>
  </oml:feature>
    <oml:feature>
    <oml:index>14741</oml:index>
    <oml:name>Var14742</oml:name>
    <oml:data_type>string</oml:data_type>
        <oml:is_target>false</oml:is_target>
    <oml:is_ignore>false</oml:is_ignore>
    <oml:is_row_identifier>false</oml:is_row_identifier>
    <oml:number_of_missing_values>49141</oml:number_of_missing_values>
  </oml:feature>
    <oml:feature>
    <oml:index>14742</oml:index>
    <oml:name>Var14743</oml:name>
    <oml:data_type>string</oml:data_type>
        <oml:is_target>false</oml:is_target>
    <oml:is_ignore>false</oml:is_ignore>
    <oml:is_row_identifier>false</oml:is_row_identifier>
    <oml:number_of_missing_values>48917</oml:number_of_missing_values>
  </oml:feature>
PGijsbers commented 3 years ago

Thanks for the report! That's an error made when uploading the dataset, those features should be nominal. We'll update.

mfeurer commented 3 years ago

And one more using task 360112 and fold 4:

[ERROR] [amlb.benchmark:14:20:52.275] 23 columns passed, passed data had 22 columns
Traceback (most recent call last):
  File "/bench/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 564, in _list_to_arrays
    columns = _validate_or_indexify_columns(content, columns)
  File "/bench/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 689, in _validate_or_indexify_columns
    f"{len(columns)} columns passed, passed data had "
AssertionError: 23 columns passed, passed data had 22 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/bench/amlb/benchmark.py", line 444, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/bench/frameworks/autosklearn/__init__.py", line 27, in run
    input_data=data, dataset=dataset, config=config)
  File "/bench/frameworks/shared/caller.py", line 93, in run_in_venv
    target_is_encoded=res.target_is_encoded)
  File "/bench/amlb/results.py", line 238, in save_predictions
    df = to_data_frame(probabilities, columns=prob_cols)
  File "/bench/amlb/datautils.py", line 150, in to_data_frame
    return pd.DataFrame.from_records(obj, columns=columns)
  File "/bench/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 1786, in from_records
    arrays, columns = to_arrays(data, columns)
  File "/bench/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 548, in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
  File "/bench/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 567, in _list_to_arrays
    raise ValueError(e) from e
ValueError: 23 columns passed, passed data had 22 columns

I tried to reproduce locally using the latest version of the connector (that above was using commit 2e41ca137c5fe46308252c0527c84716badfc3cb) and got

Removing ignored columns None.
Job `local.openml_t_360112.1h1c.KDDCup99.5.autosklearn` failed with error: argument of type 'NoneType' is not iterable
argument of type 'NoneType' is not iterable
Traceback (most recent call last):
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/job.py", line 69, in start
    result = self._run()
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/benchmark.py", line 397, in _run
    return self.run()
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/utils/process.py", line 521, in profiler
    return fn(*args, **kwargs)
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/benchmark.py", line 449, in run
    self._dataset.release()
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/utils/process.py", line 521, in profiler
    return fn(*args, **kwargs)
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/data.py", line 230, in release
    self.train.release()
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/utils/process.py", line 521, in profiler
    return fn(*args, **kwargs)
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/datasets/openml.py", line 100, in train
    self._ensure_loaded()
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/utils/process.py", line 521, in profiler
    return fn(*args, **kwargs)
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/datasets/openml.py", line 155, in _ensure_loaded
    self._load_split()
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/datasets/openml.py", line 163, in _load_split
    self._prepare_split_data(train_path, test_path)
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/datasets/openml.py", line 182, in _prepare_split_data
    col_selector, attributes = zip(*[(i, a) for i, a in enumerate(ds['attributes'])
  File "/home/feurerm/sync_dir/projects/openml/automlbenchmark/amlb/datasets/openml.py", line 183, in <listcomp>
    if a[0] not in self._oml_dataset.ignore_attribute])
TypeError: argument of type 'NoneType' is not iterable
PGijsbers commented 3 years ago

Thanks for the report! The latter looks easy to fix, but I don't know if it reveals a different problem. I'll have a go at it tomorrow.

Innixma commented 3 years ago

I also found an issue with Task 360115, with 32 GB of memory it runs OOM before even getting to the framework call for AG. It seems that the training data alone takes 5 GB of space, and is duplicated enough times in the benchmark to run OOM before calling task.fit. Not sure if any other frameworks succeed on this dataset, but it may be too large to reasonably work on 32GB memory.

With optimization, it might be possible by ensuring that the data is in memory only a minimum amount of times, then cleaned from memory except for 1 instance which is fed into the framework fit call.

[INFO] [amlb.print:20:49:15.642]   'problem_type': 'binary',
[INFO] [amlb.print:20:49:15.642]   'target': {'classes': ['-1', '1'], 'name': 'upselling'},
[INFO] [amlb.print:20:49:15.642]   'test': {'data': '/tmp/tmppz35pulo/test.data.npy'},
[INFO] [amlb.print:20:49:15.642]   'train': {'data': '/tmp/tmppz35pulo/train.data.npy'}}
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.926] Traceback (most recent call last):
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.926]   File "/repo/frameworks/shared/callee.py", line 85, in call_run
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.926]     result = run_fn(ds, config)
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.926]   File "/repo/frameworks/AutoGluon/exec.py", line 47, in run
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.926]     train = pd.DataFrame(dataset.train.data, columns=column_names).astype(column_types, copy=False)
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.926]   File "/repo/frameworks/AutoGluon/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 497, in __init__
[INFO] [amlb.print:20:49:15.926] 
[INFO] [amlb.print:20:49:15.927]     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]   File "/repo/frameworks/AutoGluon/venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 234, in init_ndarray
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]     return create_block_manager_from_blocks(block_values, [columns, index])
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]   File "/repo/frameworks/AutoGluon/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1675, in create_block_manager_from_blocks
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]     mgr._consolidate_inplace()
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]   File "/repo/frameworks/AutoGluon/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 988, in _consolidate_inplace
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]     self.blocks = tuple(_consolidate(self.blocks))
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]   File "/repo/frameworks/AutoGluon/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1909, in _consolidate
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.927]     list(group_blocks), dtype=dtype, can_consolidate=_can_consolidate
[INFO] [amlb.print:20:49:15.927] 
[INFO] [amlb.print:20:49:15.928]   File "/repo/frameworks/AutoGluon/venv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1934, in _merge_blocks
[INFO] [amlb.print:20:49:15.928] 
[INFO] [amlb.print:20:49:15.928]     new_values = new_values[argsort]
[INFO] [amlb.print:20:49:15.928] 
[INFO] [amlb.print:20:49:15.928] MemoryError: Unable to allocate 5.03 GiB for an array with shape (15001, 45000) and data type object
[INFO] [amlb.print:20:49:15.928] 
[INFO] [amlb.print:20:49:15.928] 
[INFO] [amlb.print:20:49:15.932] 
[ERROR] [amlb.benchmark:20:49:16.114] Unable to allocate 5.03 GiB for an array with shape (15001, 45000) and data type object
Traceback (most recent call last):
  File "/repo/amlb/benchmark.py", line 444, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/repo/frameworks/AutoGluon/__init__.py", line 28, in run
    input_data=data, dataset=dataset, config=config)
  File "/repo/frameworks/shared/caller.py", line 78, in run_in_venv
    raise NoResultError(res.error_message)
amlb.results.NoResultError: Unable to allocate 5.03 GiB for an array with shape (15001, 45000) and data type object
[INFO] [amlb.results:20:49:18.986] Loading metadata from `/s3bucket/output/predictions/KDDCup09-Upselling/0/metadata.json`.
[INFO] [amlb.results:20:49:20.409] Metric scores: { 'acc': nan,
  'app_version': 'dev [https://github.com/Innixma/automlbenchmark, '
                 'autogluon-workspace, 5a1ac12]',
  'auc': nan,
  'balacc': nan,
  'constraint': '1h8c',
  'duration': nan,
  'fold': 0,
  'framework': 'AutoGluon',
  'id': 'openml.org/t/360115',
  'info': 'NoResultError: Unable to allocate 5.03 GiB for an array with shape '
          '(15001, 45000) and data type object',
  'logloss': nan,
  'metric': 'auc',
  'mode': 'aws',
  'models_count': nan,
  'params': '',
  'predict_duration': nan,
  'result': nan,
  'seed': 462990052,
  'task': 'KDDCup09-Upselling',
  'training_duration': nan,
  'utc': '2020-12-24T20:49:20',
  'version': '0.0.15'}
[INFO] [amlb.job:20:49:20.410] Job local.openml_s_271.1h8c.KDDCup09-Upselling.0.AutoGluon executed in 1980.425 seconds.
[INFO] [amlb.job:20:49:20.411] All jobs executed in 1980.426 seconds.
[INFO] [amlb.utils.process:20:49:20.412] [local.openml_s_271.1h8c.KDDCup09-Upselling.0.AutoGluon] CPU Utilization: 12.5%
[INFO] [amlb.utils.process:20:49:20.412] [local.openml_s_271.1h8c.KDDCup09-Upselling.0.AutoGluon] Memory Usage: 37.2%
[INFO] [amlb.utils.process:20:49:20.412] [local.openml_s_271.1h8c.KDDCup09-Upselling.0.AutoGluon] Disk Usage: 1.3%
[INFO] [amlb.benchmark:20:49:20.412] Processing results for 
[INFO] [amlb.results:20:49:20.428] Scores saved to `/s3bucket/output/scores/AutoGluon.task_KDDCup09-Upselling.csv`.
[INFO] [amlb.results:20:49:20.440] Scores saved to `/s3bucket/output/scores/results.csv`.
[INFO] [amlb.results:20:49:20.451] Scores saved to `/s3bucket/output/results.csv`.
[INFO] [amlb.benchmark:20:49:20.458] Summing up scores for current run:
                    id                task  framework constraint fold metric mode version params                                                                     app_version                  utc  duration models_count       seed                                                                                                    info
0  openml.org/t/360115  KDDCup09-Upselling  AutoGluon       1h8c    0    auc  aws  0.0.15         dev [https://github.com/Innixma/automlbenchmark, autogluon-workspace, 5a1ac12]  2020-12-24T20:49:20    1980.4               462990052  NoResultError: Unable to allocate 5.03 GiB for an array with shape (15001, 45000) and data type object
PGijsbers commented 3 years ago

python runbenchmark.py constantpredictor openml/t/360115 works for me. While it is slow (~8 minutes), it does not run into memory issues. The ARFF file is only 1.7GB. I did notice the feature types were not marked correctly on openml, so we made openml/t/360116, but this does not fix the issue. We'll look into improving the data loading after our break.