How to benchmark datasets with "10 times 10 fold Crossvalidation" evaluaton procedure

Innixma commented 1 year ago

I tried benchmarking on task 168824, which is a 10 repeated 10 fold crossvalidation.

This would require 100 runs, 10 folds for each of the 10 repeats.

To do this, I made the following code edits: https://github.com/Innixma/automlbenchmark/commit/994b92c718076e70a82553867dc3fa1dddea4db5

I tried setting a constraint to do this:

test100f:
  folds: 100
  max_runtime_seconds: 600
  cores: 4
  min_vol_size_mb: 100000

and created a benchmark yaml file:

---
# 10 times 10-fold Crossvalidation tasks

- name: Australian
  openml_task_id: 168824

However, once getting to folds above the first 10, I got the following error:

[INFO] [amlb:00:23:55.654] Running benchmark `constantpredictor` on `/s3bucket/user/benchmarks/test100f.yaml` framework in `local` mode.
[INFO] [amlb.frameworks.definitions:00:23:55.693] Loading frameworks definitions from ['/repo/resources/frameworks.yaml', '/s3bucket/user/frameworks.yaml'].
[INFO] [amlb.resources:00:23:56.971] Loading benchmark constraint definitions from ['/repo/resources/constraints.yaml', '/s3bucket/user/constraints.yaml'].
[INFO] [amlb.benchmarks.file:00:23:56.985] Loading benchmark definitions from /s3bucket/user/benchmarks/test100f.yaml.
[INFO] [amlb.job:00:23:56.988] 
---------------------------------------------------------------------
Starting job local.test100f.test100f.Australian.13.constantpredictor.
[INFO] [amlb.benchmark:00:23:56.991] Assigning 4 cores (total=4) for new task Australian.
[INFO] [amlb.utils.process:00:23:56.991] [MONITORING] [local.test100f.test100f.Australian.13.constantpredictor] CPU Utilization: 29.7%
[INFO] [amlb.utils.process:00:23:56.993] [MONITORING] [local.test100f.test100f.Australian.13.constantpredictor] Memory Usage: 4.1%
[INFO] [amlb.benchmark:00:23:56.994] Assigning 15086 MB (total=15734 MB) for new Australian task.
[WARNING] [amlb.benchmark:00:23:56.994] WARNING: Available storage (96248.359375 MB / total=99053.671875 MB) does not meet requirements (102048 MB)!
[INFO] [amlb.utils.process:00:23:56.994] [MONITORING] [local.test100f.test100f.Australian.13.constantpredictor] Disk Usage: 2.8%
[INFO] [root:00:23:56.994] Starting [get] request for the URL https://www.openml.org/api/v1/xml/task/168824
[INFO] [root:00:23:57.773] 0.7782946s taken for [get] request for the URL https://www.openml.org/api/v1/xml/task/168824
[INFO] [root:00:23:57.773] Starting [get] request for the URL https://www.openml.org/api/v1/xml/data/40981
[INFO] [root:00:23:58.392] 0.6187537s taken for [get] request for the URL https://www.openml.org/api/v1/xml/data/40981
[INFO] [root:00:23:58.393] Starting [get] request for the URL https://www.openml.org/api/v1/xml/data/features/40981
[INFO] [root:00:23:59.004] 0.6109660s taken for [get] request for the URL https://www.openml.org/api/v1/xml/data/features/40981
[INFO] [root:00:23:59.004] Starting [get] request for the URL https://api.openml.org/data/v1/download/18151910/Australian.arff
[INFO] [root:00:23:59.319] 0.3150778s taken for [get] request for the URL https://api.openml.org/data/v1/download/18151910/Australian.arff
[INFO] [urllib3.poolmanager:00:23:59.608] Redirecting http://openml1.win.tue.nl/dataset40981/dataset_40981.pq -> https://openml1.win.tue.nl:443/dataset40981/dataset_40981.pq
[INFO] [urllib3.poolmanager:00:24:00.179] Redirecting http://openml1.win.tue.nl/dataset40981/dataset_40981.pq -> https://openml1.win.tue.nl:443/dataset40981/dataset_40981.pq
[INFO] [root:00:24:00.391] Starting [get] request for the URL https://api.openml.org/api_splits/get/168824/Task_168824_splits.arff
[INFO] [root:00:24:01.088] 0.6965258s taken for [get] request for the URL https://api.openml.org/api_splits/get/168824/Task_168824_splits.arff
[ERROR] [amlb.job:00:24:01.292] Job `local.test100f.test100f.Australian.13.constantpredictor` failed with error: OpenML task 168824 only accepts `fold` < 10.
Traceback (most recent call last):
  File "/repo/amlb/job.py", line 92, in start
    self._setup()
  File "/repo/amlb/benchmark.py", line 518, in setup
    self.load_data()
  File "/repo/amlb/benchmark.py", line 486, in load_data
    self._dataset = Benchmark.data_loader.load(DataSourceType.openml_task, task_id=self._task_def.openml_task_id, fold=self.fold)
  File "/repo/amlb/datasets/__init__.py", line 21, in load
    return self.openml_loader.load(*args, **kwargs)
  File "/repo/amlb/utils/process.py", line 710, in profiler
    return fn(*args, **kwargs)
  File "/repo/amlb/datasets/openml.py", line 52, in load
    raise ValueError("OpenML task {} only accepts `fold` < {}.".format(task_id, nfolds))
ValueError: OpenML task 168824 only accepts `fold` < 10.

How do I make automlbenchmark work for tasks with repeated cross-validation?

PGijsbers commented 1 year ago

It's currently not supported. I'm not 100% sure on all the steps to be taken, but the start would be by also extracting the number of repeats from the split dimensions here, as in repeats, folds, _ = ... (openml-python docs) and use that in the datasplitter. You can probably hack it so that a 10-repeated 10-fold gets run as if it was a 100-fold CV (though with overlap in splits) only adjusting that file (e.g., encoding repeat and fold with dataset.fold // 10 and dataset.fold % 10respectively).

For proper support I think you would need to be able to specify the repeat on invocation, and also take it into account when saving and processing the results.

Innixma commented 1 year ago

@PGijsbers Thanks! The hack seems pretty straightforward. Perhaps the hack is sufficient for official implementation? I don't see a reason why we would need to treat repeats any differently from a new fold. And if we want to figure out which repeat/fold a given fold is, we can reverse engineer by looking at the evaluation procedure of the task.

For example, if "5 times 2 fold Crossvalidation" is the tid's evaluation procedure, then we know "fold 5" == "third repeat, 2nd fold" (with 0-indexing this would translate to "repeat 2 fold 1")

PGijsbers commented 1 year ago

Without it, 100-fold CV would be indistinguishable from 10-repeated 10-fold CV (or 20-repeated 5-fold CV, or ...) without pulling the additional meta-data on the estimation procedure from OpenML. I don't really like that. It might be sufficient to just add the information to the result file(s) and use the hack otherwise.

If that complicates things too much I would consider perhaps using the hack as a temporary solution on the way to full support.

openml / automlbenchmark

How to benchmark datasets with "10 times 10 fold Crossvalidation" evaluaton procedure #515