hyperopt / hyperopt-sklearn

Hyper-parameter optimization for sklearn
hyperopt.github.io/hyperopt-sklearn
Other
1.58k stars 272 forks source link

Rewrite for sklearn1.0.0 #183

Closed mandjevant closed 2 years ago

mandjevant commented 2 years ago

Hereby my rewrite/update for hyperopt-sklearn.

Why: Since the sklearn 1.0.0 update, hyperopt-sklearn does not function. Fixing this was the original intent behind my edits, thus the name of my branch. When diving into the code, I found more things that should be added and could be optimized.

Additions and changes:

Suggested future edits:


I'd love some reviews or feedback.

B0Gec commented 2 years ago

This seems like a sound idea and something that I was missing from it!

BTW, I was wondering, what is your opinion about changing versioning style to timestamp format, e.g. 2022.02.17 instead of 1.0.0.

Another things on my wish-list are sklearn.dummy.DummyRegressor and the following feature.

A nice feature to have would be an optional parameter like scikit_defauls: Bool = False, which would set scikit defaults to unspecified optional parameters and thus narrowing the search space instead of setting these as default hp-spaces (larger search space). So

random_forest_regression(scikit_defaults: Bool = True, min_samples_split=hp.space(...))

would return: ... 17 n_min_samples_split = 18 hp.space(...) 19 n_estimators = 20 Literal{100} ... instead of (now or with default scikit_defaults=False): ... 17 n_min_samples_split = 18 hp.space(...) 19 n_estimators = 20 hp.space(...) ...

mandjevant commented 2 years ago

Thank you for your input @bostjangec

BTW, I was wondering, what is your opinion about changing versioning style to timestamp format, e.g. 2022.02.17 instead of 1.0.0.

I'm not a fan, dates for releases are always available on github. It's better to name the version something more simple and meaningful.

Another things on my wish-list are sklearn.dummy.DummyRegressor

I have added DummyRegressor and DummyClassifier. https://github.com/mandjevant/hyperopt-sklearn/blob/40aa1d3286e0f42c1ee803119a249a624a4d2f14/hpsklearn/components/dummy.py#L17

A nice feature to have would be an optional parameter like scikit_defauls: Bool = False, which would set scikit defaults to unspecified optional parameters and thus narrowing the search space instead of setting these as default hp-spaces (larger search space).

Interesting idea. Though the parameters that scikit-learn defaults are often the ones that are most relevant and have the most impact. Optimizing a very restricted search space often does not work. If for your use-case you require this, you could create your own pyll.Apply objects (that only considers sklearn defaults) and input them in the search space. Do keep in mind that the sklearn defaults are merely an indication, some parameters vary highly on the input data.

bjkomer commented 2 years ago

Thanks for your amazing work on this PR! There's a lot here, and I'm hoping to be able to give it a more thorough review soon.

When testing out the changes I've run into errors when using it on Linux. This appears to stem from a difference with how multiprocessing is handled on Linux/Windows. On Linux a Pipe is a class called multiprocessing.connection.Connection but on Windows it is multiprocessing.connection.PipeConnection. All of the type checks are assuming the second version, which leads to an AttributeError. I'm not sure what the best solution is to get around this, but looking into it a bit it looks like they both share _ConnectionBase as a base class, and that allows the tests to pass on Linux (from the question here).

mandjevant commented 2 years ago

Hey there, a quick update. I've added ubuntu-latest to the GitHub workflow file.

Unfortunately, I've run into a weird error that I can not reproduce locally where the test run on ubuntu pauses after the first test and times out 6 hours later.

I've been trying to reproduce it but I am giving up. In the coming week, I will rewrite the workflow file to create tests in virtual environments or in docker containers instead of using tox.

bjoerm commented 2 years ago

Thanks all for your work.

Could you please after this is merged also use the following release to update the version on PyPI? The latest version there is from 2017. That would be awesome for those who have (company security) restrictions from where they can install packages from.

Sorry for being a bit off topic with this wish. If you disagree, just ignore my comment, so I don't derail the discussion here. Thanks!

mandjevant commented 2 years ago

Hey @bjoerm thanks for your suggestion, I'll use this comment to respond to you and update everyone with the progress. When this PR is merged I'll discuss with the rest how to get it on pip too.

Regarding this PR, the issues with the automated tests not working on ubuntu-latest originated from the coverage module. This module has been downgraded and now it works like a charm. The final thing I want to do (before re-requesting reviews) is to ensure that estimator tests automatically rerun on the AllTrialsFailed exception. It's an easy implementation but I've been swamped. I hope to get to it this weekend. When this is committed, I will re-request reviews.

mandjevant commented 2 years ago

Tests are passing on both windows and ubuntu.

@bjkomer Could I get a re-review?

bjkomer commented 2 years ago

This is great!! All tests pass locally for me as well, and basic examples are working fine. I'll try to find some time in the next little while to review all of the changes.

mandjevant commented 2 years ago

Thank you for the review @bjkomer. The suggested changes are implemented and tests are running as we speak. Is there anyone else you'd like a review from before this is merged?

bjkomer commented 2 years ago

Ideally we'd have more reviews for changes this large, but I don't see that happening. Feel free to merge whenever you are ready.

mandjevant commented 2 years ago

Ideally we'd have more reviews for changes this large, but I don't see that happening. Feel free to merge whenever you are ready.

Alright, I'll leave some time for additional reviews and merge it on 23/04/2022. When new issues arise after merging, I will have time from 23/04 to fix them quickly.

Thanks a lot for your help in reviewing this PR.