Open PGijsbers opened 3 years ago
Great! Thank you for the information.
Maybe a little off-topic. What about running AutoML systems that need to set a parameter to run? My AutoML by default is in Explain
mode, for benchmark it should be run with Compete
mode. Should I update the default before the new benchmark evaluation?
The default framework definition for MLJar in the benchmark currently sets the parameter. Definition:
mljarsupervised:
version: '0.6.0'
project: https://github.com/mljar/mljar-supervised
params:
mode: "Compete" # set mode for Compete, default mode is Explain
This is the desired way to set non-default hyperparameters, so it should be good to go 👍
Thank you @PGijsbers, Is it allowed to set parameters for benchmarks? Or framework should run without any additional parameters?
@pplonski no, this is not allowed to set parameters in this file, as we'll be using it for the AutoML Benchmark paper. We potentially accept only one high-level parameter like this one, but we're also working on including measures of model complexity to evaluate the various tools.
Also, for frameworks like yours that use a non-default param, we'll create 2 entries:
mljarsupervised
will use only default parameters.mljarsupervised_compete
can be also added using the definition above.
Only one those will be used for the official runs, so please tell us which one you want us to use, thanks.Yes, sorry about the confusion. I believe there was a consensus on allowing one "mode" hyperparameter, and the author can indicate what it should be set to for the evaluation. More hyperparameter configuration will not be allowed at this moment, as we want to encourage AutoML tools to stay "automated". But should you want to set additional hyperparameters for your own experiments, that would be the place to do it.
@pplonski the follow-up PR after my previous comment: https://github.com/openml/automlbenchmark/pull/193
I expect you want us to use mljarsupervised_compete
in the next full benchmark run.
@sebhrusen thank you for the update!
Yes, mljarsupervised_compete
should be used for the benchmark run.
What is more, I'm working on an updated mljar version. So I will do PR before benchmarks run (at the beginning of December).
Hi together, I am currently setting up scripts to generate meta-data and think that there is a discrepancy in the benchmark suite IDs. I'm running the following code:
datasets_suite_218 = [
1590, 1468, 1486, 1489, 23512, 23517, 4135, 40996, 41027, 40981, 40984, 40685,
1111, 1169, 41161, 41163, 41164, 41165, 41166, 41167, 41168, 41169, 41142,
41143, 41146, 41147, 41150, 41159, 41138, 1596, 54, 1461, 1464, 5, 12, 2, 3,
40668, 1067, 40975, 31
]
datasets_suite_270 = [
1515, 1457, 1475, 4541, 4534, 4538, 4134, 40978, 40982, 40983, 40701, 40670,
40900, 42732, 42733, 42734, 40498, 41162, 41144, 41145, 41156, 41157, 41158,
181, 188, 1494, 23, 1487, 1049
]
datasets_suite_271 = [
1590, 1515, 1457, 1475, 1468, 1486, 1489, 23512, 23517, 4541, 4534, 4538, 4134,
4135, 40978, 40996, 41027, 40981, 40982, 40983, 40984, 40701, 40670, 40685, 40900,
1111, 42732, 42733, 42734, 40498, 41161, 41162, 41163, 41164, 41165, 41166, 41167,
41168, 41169, 41142, 41143, 41144, 41145, 41146, 41147, 41150, 41156, 41157, 41158,
41159, 41138, 54, 181, 188, 1461, 1494, 1464, 12, 23, 3, 1487, 40668, 1067, 1049,
40975, 31
]
print(len(datasets_suite_218), len(datasets_suite_270), len(datasets_suite_271))
which should (to the best of my knowledge print) 38, X, 38 + Y
, however it prints 41, 29, 66
. I know that suite 218 contains three wrong datasets, but even if I subtract those the numbers don't add up. Would it be possible that you paste the actual IDs in this issues
EDIT/UPDATE:
I copied above dataset IDs from the OpenML website. To make this issue easily reproducible I also just created this snippet:
import openml
print(
len(openml.study.get_suite(218).tasks),
len(openml.study.get_suite(270).tasks),
len(openml.study.get_suite(271).tasks)
)
which gives 42 29 66
, yet another different set of numbers.
Happy to see that regression is being added to the benchmark!
For AutoGluon, please use the presets='best_quality'
parameter as the mode option. Also, not sure if this is the case now, but I remember that the EC2 instances were configured to use a hard disk with low amounts of space by default. AutoGluon requires more disk space than most AutoML frameworks, so I think it would be good to provide at least 1 TB to the machines (Disk is cheap compared to compute). Also, setting EC2 volume_type: gp2
in config.yaml
will use an SSD instead of a hard disk, which is more practical for AutoGluon as it does not keep all models in memory but rather loads models from disk for its predictions.
Regarding the versions of the frameworks used, the latest stable release of AutoGluon is v0.0.15, although our mainline is 2 months ahead of v0.0.15 in development and includes various improvements. Is it expected that frameworks will be evaluated by their stable release or is it up to the framework authors to update the framework code in AutoMLBenchmark to represent their preferred version for the benchmark? If pre-release/cutting-edge is encouraged, I would like to submit a PR to update the AutoGluon framework.
@mfeurer Thanks to your report we found that airlines and covertype were incorrectly not included in the 271 study. We also updated openml/s/218 to remove the extra tasks which were not used in the benchmark paper (1, 2 and 5).
import openml
def task_to_dataset_name(tid):
task = openml.tasks.get_task(tid, download_data=False)
dataset = openml.datasets.get_dataset(task.dataset_id, download_data=False)
return dataset.name.lower()
original_datasets = [task_to_dataset_name(tid) for tid in openml.study.get_suite(218).tasks]
new_datasets = [task_to_dataset_name(tid) for tid in openml.study.get_suite(270).tasks]
all_datasets = [task_to_dataset_name(tid) for tid in openml.study.get_suite(271).tasks]
print(len(original_datasets), 'old datasets +', len(new_datasets), 'new datasets totals', len(all_datasets), 'datasets.')
print(set(all_datasets) == (set(original_datasets) | set(new_datasets)))
>>> 39 old datasets + 33 new datasets totals 71 datasets.
>>> True.
We replaced the old higgs dataset (in the 39) with a new one (in the 32), which explains why the total is one short of their sum. The new Higgs is based on a larger subset of the data, and will replace the old Higgs. Note that between your comment and my response we added the following new classification tasks (to both /s/270 and /s/271):
@Innixma We are currently evaluating if we can switch to 256gb SSD instances. The version used in the benchmark should be publicly released (e.g. on PyPI), but it can be a development/pre-release version. We will not allow fixing versions to specific git commits. We encourage you to use a recent release (development, if not stable) and update the integration code accordingly, if you feel confident about the state of your system.
For regression we have also replaced the Airline 1M rows task (359926) with the Airline 10M rows task (359929).
What will be the RAM limit?
We're planning on using the same instance type again, m5.2xlarge
, which feature 32 GiB of Memory and 8 vCPUs.
FYI I posted an error related to the new task 360115 in issue #233 as it appears to be incompatible with the benchmarking framework.
The version used in the benchmark should be publicly released (e.g. on PyPI), but it can be a development/pre-release version. We will not allow fixing versions to specific git commits. We encourage you to use a recent release (development, if not stable) and update the integration code accordingly, if you feel confident about the state of your system.
Could you please use the recent pre-release of Auto-sklearn for the benchmark? I wasn't able to find any guidelines on how to make that the default, but was only able to create a PR to support the version name, see #228.
Thanks! We'll make sure to use that release, we will likely create a new frameworks file (e.g. frameworks_journal.yaml
) to make it easier to come back to later.
Hi @PGijsbers, what is the state of the benchmark? Have you started already? The mljar-supervised
in the latest version 0.7.15
should be ready for the run.
If there is something that I can help with automlbenchmark
code, I offer my help.
Thanks for the offer @pplonski, very kind 👍 We have not started yet, but would like to start soon. We're just back from celebrating the holidays so we're going through our github feed and making a list to ourselves to see what needs work.
Thanks for providing more datasets, this is very helpful.
I try to get the tasks for the study 216. When calling
task = openml.tasks.get_task(13854, download_data=True)
I received the following error:
openml.exceptions.OpenMLServerError: Unexpected server error when calling https://www.openml.org//api_splits/get/13854/Task_13854_splits.arff. Please contact the developers! Status code: 412
Is there some other way to get the dataset splits? I tried to get the dataset from dataset id but there is no splits information.
That task split has recently been removed from the server. I'll update the study today with a new task for the same datasets (the other QSAR task is also affected).
Thanks, the (only) other task has the same problem is 14097.
I posted a reply here, the issues you experienced should be resolved (if you use the new tasks).
It works now, thanks!
We plan to do a new evaluation soon and want to share information about what we have been working on. The new evaluation will be included in a journal paper submission and available through OpenML.
The first big change is the addition of regression tasks to the benchmark. We curated a list of regression tasks that we think both represent real-world problems and is (in principle) compatible with all frameworks. The list is available on OpenML suite 269 (https://www.openml.org/s/269). We will likely use the root mean squared error to evaluate regression performance (but we encourage authors to verify other common regression metrics too).
The second change is the addition of new datasets to the classification benchmark. We are currently working on the final few datasets. We will keep you updated through a this Github issue. The list of only new datasets is available in suite 270 (https://www.openml.org/s/270), while the full list of all classification datasets for the new evaluation is available in suite 271 (https://www.openml.org/s/271).
In case you find problems with the datasets or tasks themselves, please also report them there.
Thank you all for your contributions to the benchmark!