Rewrite for sklearn1.0.0

mandjevant commented 2 years ago

Hereby my rewrite/update for hyperopt-sklearn.

Why: Since the sklearn 1.0.0 update, hyperopt-sklearn does not function. Fixing this was the original intent behind my edits, thus the name of my branch. When diving into the code, I found more things that should be added and could be optimized.

Additions and changes:

Added/Updated I have implemented all or nearly all components in scikit-learn; classifiers, regressors and preprocessors. Many components were missing but I would have liked to use them. Others I added just for the sake of completeness.
Added type hinting for parameters.
Added. To have easier debugging with custom parameter values, I have made a validation_test decorator that verifies the input of some parameters. Since all these checks are also done in sklearn, it does not make sense to have this complete. Therefore I've only implemented it where I saw fit.
Added a totally new structure of the hp-sklearn project.
- There is a main project folder where two sub-folders (1. components and 2. estimator) are located.
- Components are now located in a separate folder (instead of a single file). Where the file structure mimicks the file structure of the scikit-learn project. (Mimicks in the sense of naming conventions and thus location of search spaces for components)
- The estimator folder contains the estimator where the functions are separated over files based on their purpose. Problems in the estimator should now be very easy to find.
- Tests are located in a separate test folder that follows the exact structure of the sub-folders in the main project folder.
- Common helper functions in the components folder and also in the test folder are contained in a utils.py file.
Added automated testing on pull requests and commits using unittest, coverage, tox and github actions. This will be beneficial short-term for maintenance of the project. This will be beneficial long-term when I/we iron out some present issues and create workflows that better suit this use-case (for example; running only relevant tests on a pr).
Updated python setup files and project files.
Updated some python 3 best practices.
Added #169 and #137

Suggested future edits:

Addition of some nice badges on the readme.
A workflow that takes parameterized commands from comments on issues and pull requests that tests a module or posts coverage.
Edits to the workflow so only relevant tests will run on pull requests.
Following changes in scikit-learn and maintaining hyperopt-sklearn accordingly.

I'd love some reviews or feedback.

B0Gec commented 2 years ago

This seems like a sound idea and something that I was missing from it!

BTW, I was wondering, what is your opinion about changing versioning style to timestamp format, e.g. 2022.02.17 instead of 1.0.0.

Another things on my wish-list are sklearn.dummy.DummyRegressor and the following feature.

A nice feature to have would be an optional parameter like scikit_defauls: Bool = False, which would set scikit defaults to unspecified optional parameters and thus narrowing the search space instead of setting these as default hp-spaces (larger search space). So

random_forest_regression(scikit_defaults: Bool = True, min_samples_split=hp.space(...))

would return: ... 17 n_min_samples_split = 18 hp.space(...) 19 n_estimators = 20 Literal{100} ... instead of (now or with default scikit_defaults=False): ... 17 n_min_samples_split = 18 hp.space(...) 19 n_estimators = 20 hp.space(...) ...

mandjevant commented 2 years ago

Thank you for your input @bostjangec

BTW, I was wondering, what is your opinion about changing versioning style to timestamp format, e.g. 2022.02.17 instead of 1.0.0.

I'm not a fan, dates for releases are always available on github. It's better to name the version something more simple and meaningful.

Another things on my wish-list are sklearn.dummy.DummyRegressor

I have added DummyRegressor and DummyClassifier. https://github.com/mandjevant/hyperopt-sklearn/blob/40aa1d3286e0f42c1ee803119a249a624a4d2f14/hpsklearn/components/dummy.py#L17

A nice feature to have would be an optional parameter like scikit_defauls: Bool = False, which would set scikit defaults to unspecified optional parameters and thus narrowing the search space instead of setting these as default hp-spaces (larger search space).

Interesting idea. Though the parameters that scikit-learn defaults are often the ones that are most relevant and have the most impact. Optimizing a very restricted search space often does not work. If for your use-case you require this, you could create your own pyll.Apply objects (that only considers sklearn defaults) and input them in the search space. Do keep in mind that the sklearn defaults are merely an indication, some parameters vary highly on the input data.

bjkomer commented 2 years ago

Thanks for your amazing work on this PR! There's a lot here, and I'm hoping to be able to give it a more thorough review soon.

When testing out the changes I've run into errors when using it on Linux. This appears to stem from a difference with how multiprocessing is handled on Linux/Windows. On Linux a Pipe is a class called multiprocessing.connection.Connection but on Windows it is multiprocessing.connection.PipeConnection. All of the type checks are assuming the second version, which leads to an AttributeError. I'm not sure what the best solution is to get around this, but looking into it a bit it looks like they both share _ConnectionBase as a base class, and that allows the tests to pass on Linux (from the question here).

mandjevant commented 2 years ago

Hey there, a quick update. I've added ubuntu-latest to the GitHub workflow file.

Unfortunately, I've run into a weird error that I can not reproduce locally where the test run on ubuntu pauses after the first test and times out 6 hours later.

I've been trying to reproduce it but I am giving up. In the coming week, I will rewrite the workflow file to create tests in virtual environments or in docker containers instead of using tox.

bjoerm commented 2 years ago

Thanks all for your work.

Could you please after this is merged also use the following release to update the version on PyPI? The latest version there is from 2017. That would be awesome for those who have (company security) restrictions from where they can install packages from.

Sorry for being a bit off topic with this wish. If you disagree, just ignore my comment, so I don't derail the discussion here. Thanks!

mandjevant commented 2 years ago

Hey @bjoerm thanks for your suggestion, I'll use this comment to respond to you and update everyone with the progress. When this PR is merged I'll discuss with the rest how to get it on pip too.

Regarding this PR, the issues with the automated tests not working on ubuntu-latest originated from the coverage module. This module has been downgraded and now it works like a charm. The final thing I want to do (before re-requesting reviews) is to ensure that estimator tests automatically rerun on the AllTrialsFailed exception. It's an easy implementation but I've been swamped. I hope to get to it this weekend. When this is committed, I will re-request reviews.

mandjevant commented 2 years ago

Tests are passing on both windows and ubuntu.

@bjkomer Could I get a re-review?

bjkomer commented 2 years ago

This is great!! All tests pass locally for me as well, and basic examples are working fine. I'll try to find some time in the next little while to review all of the changes.

mandjevant commented 2 years ago

Thank you for the review @bjkomer. The suggested changes are implemented and tests are running as we speak. Is there anyone else you'd like a review from before this is merged?

bjkomer commented 2 years ago

Ideally we'd have more reviews for changes this large, but I don't see that happening. Feel free to merge whenever you are ready.

mandjevant commented 2 years ago

Ideally we'd have more reviews for changes this large, but I don't see that happening. Feel free to merge whenever you are ready.

Alright, I'll leave some time for additional reviews and merge it on 23/04/2022. When new issues arise after merging, I will have time from 23/04 to fix them quickly.

Thanks a lot for your help in reviewing this PR.

hyperopt / hyperopt-sklearn

Rewrite for sklearn1.0.0 #183