rodrigo-arenas / Sklearn-genetic-opt

ML hyperparameters tuning and features selection, using evolutionary algorithms.
https://sklearn-genetic-opt.readthedocs.io
MIT License
289 stars 73 forks source link

Continuous not working for hyperparameters with hard limits #68

Closed poroc300 closed 2 years ago

poroc300 commented 2 years ago

System information OS Platform and Distribution: Windows 10 Sklearn-genetic-opt version: 0.6.0 Scikit-learn version: 0.24.1 Python version: 3.8

Describe the bug When defining a Continuous parameter range, it appears the generated values are not within the specified range. This is evident for algorithms that have hyperparameters, in which, the values can only be within an interval (e.g. between 0 and 1). Below, I show an example with a RandomForestRegressor where the parameter min_weight_fraction_leaf has a limit of [0 - 0.5].

To Reproduce

import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous

#generate input 
X = np.random.normal(75, 10, (1000, 2))
y = np.random.normal(200, 20, 1000)
cv = KFold(n_splits=5, random_state=42, shuffle=True)

#parameters
params = {"max_depth": Integer(1, 10), 'min_weight_fraction_leaf': Continuous(0.45, 0.49)}

#genetic optimization
evolved = GASearchCV(estimator=RandomForestRegressor(n_estimators=1),
                     cv=cv,
                     population_size=30,
                     generations=40,
                     tournament_size=5,
                     elitism=True,
                     crossover_probability=0.85,
                     mutation_probability=0.15,
                     param_grid=params,
                     criteria='max',
                     scoring="neg_mean_absolute_error",
                     algorithm='eaMuPlusLambda',
                     error_score="raise",
                     n_jobs=-1,
                     verbose=True,
                     keep_top_k=10)
evolved.fit(X, y)

Expected behavior The analysis above will raise ValueError: min_weight_fraction_leaf must in [0, 0.5]. This means that Continuous(0.45, 0.49) is generating values outside [0 - 0.5] even though I have specified those to be within 0.45 and 0.49. This problem also occurs with other algorithms, such as XGBRegressor with the parameter subsample (interval between 0 and 1). In the latter case I have specified Continuous(0, 1) and I was getting the error that values for subsample were 1.2 or even higher.

Screenshots Full error log for the analysis with RandomForestRegressor:

Traceback (most recent call last):
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 431, in _process_worker
    r = call_item()
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 387, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 169, in _parallel_build_trees
    tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1252, in fit
    super().fit(
  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 285, in fit
    raise ValueError("min_weight_fraction_leaf must in [0, 0.5]")
ValueError: min_weight_fraction_leaf must in [0, 0.5]
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-1-855014dfa2b2>", line 34, in <module>
    evolved.fit(X, y)

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py", line 120, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 455, in fit
    pop, log, n_gen = self._select_algorithm(

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 547, in _select_algorithm
    pop, log, gen = eaMuPlusLambda(

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\algorithms.py", line 265, in eaMuPlusLambda
    for ind, fit in zip(invalid_ind, fitnesses):

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn_genetic\genetic_search.py", line 377, in evaluate
    cv_results = cross_validate(

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)

  File "C:\Users\andre\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 250, in cross_validate
    results = parallel(

  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 1054, in __call__
    self.retrieve()

  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))

  File "C:\Users\andre\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)

  File "C:\Users\andre\anaconda3\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()

  File "C:\Users\andre\anaconda3\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception

ValueError: min_weight_fraction_leaf must in [0, 0.5]
rodrigo-arenas commented 2 years ago

Hi, thanks for the report! I've checked this and it happens because spacy samples from a uniform distribution with limits [lower, lower + upper] instead of [lower, upper], that is something that I missed. I'll fix this and release it on 0.6.1 this week.

poroc300 commented 2 years ago

Thank you for addressing this.