openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
401 stars 132 forks source link

TunedRF fails some large datasets under time constraints #351

Open PGijsbers opened 3 years ago

PGijsbers commented 3 years ago

Per @mfeurer in #337:

Alright, I can now pretty much reproduce the results except for a 4 exceptions:

  • helena: the inner process is killed without an error message (The only output in the logs is KILLED)
  • dionis: same
  • airlines: runs over the time limit. As I'm only running 1h, it probably works in the 4h setting
  • covertype: same

We should confirm this is an issue with not being able to do the evaluations in time. There's only so much we could do to fix it while keeping the baseline understandable. But we might use e.g. hold-out for evaluation on large datasets instead of 5-fold CV and/or use models trained during 5-fold CV directly instead of retraining at the end.

sebhrusen commented 3 years ago

we might use e.g. hold-out for evaluation on large datasets instead of 5-fold CV

sounds reasonable, we would probably also need to change the budget allocation in this case to ensure that the final model trained with the full dataset would still have enough time to complete: from 85/15 to maybe 50/50? or this could estimated on the first RF model trained (or the 3rd? using all features therefore the slowest).

and/or use models trained during 5-fold CV directly instead of retraining at the end

what do you mean here exactly? use the 5 CV models of the best max_features, compute predictions on the test dataset for each of them and apply some voting mechanism to obtain the final predictions?

PGijsbers commented 3 years ago

use the 5 CV models of the best max_features, compute predictions on the test dataset for each of them and apply some voting mechanism to obtain the final predictions?

Yes, using the average as voting scheme. Table 2 in Caruana et al. (2006)) suggests that this leads to better performance than a single model retrained on all data (MODSEL-BOTH-CV v. MODSEL-BOTH). It's nice because it doesn't require a refit, but it wouldn't help us if we do start using hold-out for large datasets.

PGijsbers commented 2 years ago

Just to note, this has been largely (though not entirely) fixed by https://github.com/openml/automlbenchmark/pull/441