Benchmark Update: Regression and more Classification!

PGijsbers commented 3 years ago

We plan to do a new evaluation soon and want to share information about what we have been working on. The new evaluation will be included in a journal paper submission and available through OpenML.

The first big change is the addition of regression tasks to the benchmark. We curated a list of regression tasks that we think both represent real-world problems and is (in principle) compatible with all frameworks. The list is available on OpenML suite 269 (https://www.openml.org/s/269). We will likely use the root mean squared error to evaluate regression performance (but we encourage authors to verify other common regression metrics too).

The second change is the addition of new datasets to the classification benchmark. We are currently working on the final few datasets. We will keep you updated through a this Github issue. The list of only new datasets is available in suite 270 (https://www.openml.org/s/270), while the full list of all classification datasets for the new evaluation is available in suite 271 (https://www.openml.org/s/271).

In case you find problems with the datasets or tasks themselves, please also report them there.

Thank you all for your contributions to the benchmark!

pplonski commented 3 years ago

Great! Thank you for the information.

Maybe a little off-topic. What about running AutoML systems that need to set a parameter to run? My AutoML by default is in Explain mode, for benchmark it should be run with Compete mode. Should I update the default before the new benchmark evaluation?

PGijsbers commented 3 years ago

The default framework definition for MLJar in the benchmark currently sets the parameter. Definition:

mljarsupervised:
  version: '0.6.0'
  project: https://github.com/mljar/mljar-supervised
  params:
    mode: "Compete"   # set mode for Compete, default mode is Explain

This is the desired way to set non-default hyperparameters, so it should be good to go 👍

pplonski commented 3 years ago

Thank you @PGijsbers, Is it allowed to set parameters for benchmarks? Or framework should run without any additional parameters?

sebhrusen commented 3 years ago

@pplonski no, this is not allowed to set parameters in this file, as we'll be using it for the AutoML Benchmark paper. We potentially accept only one high-level parameter like this one, but we're also working on including measures of model complexity to evaluate the various tools.

Also, for frameworks like yours that use a non-default param, we'll create 2 entries:

mljarsupervised will use only default parameters.
mljarsupervised_compete can be also added using the definition above. Only one those will be used for the official runs, so please tell us which one you want us to use, thanks.

PGijsbers commented 3 years ago

Yes, sorry about the confusion. I believe there was a consensus on allowing one "mode" hyperparameter, and the author can indicate what it should be set to for the evaluation. More hyperparameter configuration will not be allowed at this moment, as we want to encourage AutoML tools to stay "automated". But should you want to set additional hyperparameters for your own experiments, that would be the place to do it.

sebhrusen commented 3 years ago

@pplonski the follow-up PR after my previous comment: https://github.com/openml/automlbenchmark/pull/193 I expect you want us to use mljarsupervised_compete in the next full benchmark run.

pplonski commented 3 years ago

@sebhrusen thank you for the update!

Yes, mljarsupervised_compete should be used for the benchmark run.

What is more, I'm working on an updated mljar version. So I will do PR before benchmarks run (at the beginning of December).

mfeurer commented 3 years ago

Hi together, I am currently setting up scripts to generate meta-data and think that there is a discrepancy in the benchmark suite IDs. I'm running the following code:

datasets_suite_218 = [
    1590, 1468, 1486, 1489, 23512, 23517, 4135, 40996, 41027, 40981, 40984, 40685, 
    1111, 1169, 41161, 41163, 41164, 41165, 41166, 41167, 41168, 41169, 41142, 
    41143, 41146, 41147, 41150, 41159, 41138, 1596, 54, 1461, 1464, 5, 12, 2, 3, 
    40668, 1067, 40975, 31 
]
datasets_suite_270 = [
    1515, 1457, 1475, 4541, 4534, 4538, 4134, 40978, 40982, 40983, 40701, 40670, 
    40900, 42732, 42733, 42734, 40498, 41162, 41144, 41145, 41156, 41157, 41158, 
    181, 188, 1494, 23, 1487, 1049
]
datasets_suite_271 = [
    1590, 1515, 1457, 1475, 1468, 1486, 1489, 23512, 23517, 4541, 4534, 4538, 4134, 
    4135, 40978, 40996, 41027, 40981, 40982, 40983, 40984, 40701, 40670, 40685, 40900, 
    1111, 42732, 42733, 42734, 40498, 41161, 41162, 41163, 41164, 41165, 41166, 41167, 
    41168, 41169, 41142, 41143, 41144, 41145, 41146, 41147, 41150, 41156, 41157, 41158, 
    41159, 41138, 54, 181, 188, 1461, 1494, 1464, 12, 23, 3, 1487, 40668, 1067, 1049, 
    40975, 31 
]
print(len(datasets_suite_218), len(datasets_suite_270), len(datasets_suite_271))

which should (to the best of my knowledge print) 38, X, 38 + Y, however it prints 41, 29, 66. I know that suite 218 contains three wrong datasets, but even if I subtract those the numbers don't add up. Would it be possible that you paste the actual IDs in this issues

EDIT/UPDATE:

I copied above dataset IDs from the OpenML website. To make this issue easily reproducible I also just created this snippet:

import openml
print(
    len(openml.study.get_suite(218).tasks), 
    len(openml.study.get_suite(270).tasks), 
    len(openml.study.get_suite(271).tasks)
)

which gives 42 29 66, yet another different set of numbers.

Innixma commented 3 years ago

Happy to see that regression is being added to the benchmark!

For AutoGluon, please use the presets='best_quality' parameter as the mode option. Also, not sure if this is the case now, but I remember that the EC2 instances were configured to use a hard disk with low amounts of space by default. AutoGluon requires more disk space than most AutoML frameworks, so I think it would be good to provide at least 1 TB to the machines (Disk is cheap compared to compute). Also, setting EC2 volume_type: gp2 in config.yaml will use an SSD instead of a hard disk, which is more practical for AutoGluon as it does not keep all models in memory but rather loads models from disk for its predictions.

Regarding the versions of the frameworks used, the latest stable release of AutoGluon is v0.0.15, although our mainline is 2 months ahead of v0.0.15 in development and includes various improvements. Is it expected that frameworks will be evaluated by their stable release or is it up to the framework authors to update the framework code in AutoMLBenchmark to represent their preferred version for the benchmark? If pre-release/cutting-edge is encouraged, I would like to submit a PR to update the AutoGluon framework.

PGijsbers commented 3 years ago

@mfeurer Thanks to your report we found that airlines and covertype were incorrectly not included in the 271 study. We also updated openml/s/218 to remove the extra tasks which were not used in the benchmark paper (1, 2 and 5).

import openml

def task_to_dataset_name(tid):
    task = openml.tasks.get_task(tid, download_data=False)
    dataset = openml.datasets.get_dataset(task.dataset_id, download_data=False)
    return dataset.name.lower()

original_datasets = [task_to_dataset_name(tid) for tid in openml.study.get_suite(218).tasks]
new_datasets = [task_to_dataset_name(tid) for tid in openml.study.get_suite(270).tasks]
all_datasets = [task_to_dataset_name(tid) for tid in openml.study.get_suite(271).tasks]

print(len(original_datasets), 'old datasets +', len(new_datasets), 'new datasets totals', len(all_datasets), 'datasets.')
print(set(all_datasets) == (set(original_datasets) | set(new_datasets)))

>>> 39 old datasets + 33 new datasets totals 71 datasets.
>>> True.

We replaced the old higgs dataset (in the 39) with a new one (in the 32), which explains why the total is one short of their sum. The new Higgs is based on a larger subset of the data, and will replace the old Higgs. Note that between your comment and my response we added the following new classification tasks (to both /s/270 and /s/271):

KDD99: 360112
Porto: 360113
Higgs: 360114
KDD09U: 360115

@Innixma We are currently evaluating if we can switch to 256gb SSD instances. The version used in the benchmark should be publicly released (e.g. on PyPI), but it can be a development/pre-release version. We will not allow fixing versions to specific git commits. We encourage you to use a recent release (development, if not stable) and update the integration code accordingly, if you feel confident about the state of your system.

PGijsbers commented 3 years ago

For regression we have also replaced the Airline 1M rows task (359926) with the Airline 10M rows task (359929).

pplonski commented 3 years ago

What will be the RAM limit?

PGijsbers commented 3 years ago

We're planning on using the same instance type again, m5.2xlarge, which feature 32 GiB of Memory and 8 vCPUs.

mfeurer commented 3 years ago

FYI I posted an error related to the new task 360115 in issue #233 as it appears to be incompatible with the benchmarking framework.

mfeurer commented 3 years ago

The version used in the benchmark should be publicly released (e.g. on PyPI), but it can be a development/pre-release version. We will not allow fixing versions to specific git commits. We encourage you to use a recent release (development, if not stable) and update the integration code accordingly, if you feel confident about the state of your system.

Could you please use the recent pre-release of Auto-sklearn for the benchmark? I wasn't able to find any guidelines on how to make that the default, but was only able to create a PR to support the version name, see #228.

PGijsbers commented 3 years ago

Thanks! We'll make sure to use that release, we will likely create a new frameworks file (e.g. frameworks_journal.yaml) to make it easier to come back to later.

pplonski commented 3 years ago

Hi @PGijsbers, what is the state of the benchmark? Have you started already? The mljar-supervised in the latest version 0.7.15 should be ready for the run.

If there is something that I can help with automlbenchmark code, I offer my help.

PGijsbers commented 3 years ago

Thanks for the offer @pplonski, very kind 👍 We have not started yet, but would like to start soon. We're just back from celebrating the holidays so we're going through our github feed and making a list to ourselves to see what needs work.

huibinshen commented 3 years ago

Thanks for providing more datasets, this is very helpful.

I try to get the tasks for the study 216. When calling task = openml.tasks.get_task(13854, download_data=True) I received the following error: openml.exceptions.OpenMLServerError: Unexpected server error when calling https://www.openml.org//api_splits/get/13854/Task_13854_splits.arff. Please contact the developers! Status code: 412

Is there some other way to get the dataset splits? I tried to get the dataset from dataset id but there is no splits information.

PGijsbers commented 3 years ago

That task split has recently been removed from the server. I'll update the study today with a new task for the same datasets (the other QSAR task is also affected).

huibinshen commented 3 years ago

Thanks, the (only) other task has the same problem is 14097.

PGijsbers commented 3 years ago

I posted a reply here, the issues you experienced should be resolved (if you use the new tasks).

huibinshen commented 3 years ago

It works now, thanks!

openml / automlbenchmark

Benchmark Update: Regression and more Classification! #187