Closed iblasi closed 6 years ago
Hi @trevorstephens
Looking to all the source code I understand why happens the previously commented situation, and I have a recomendation to improve the SymbolicTransformer if you consider. To use SymbolicTransformer, you are doing the following steps on code genetic.py to extract complex added value features.
You sort by fitness
on line 479 hall_of_fame = fitness.argsort()[:self.hall_of_fame]
and then take the number of components to fix them as _best_programs
on line 500 self._best_programs = [self._programs[-1][i] for i in hall_of_fame[components]]
.
The _best_programs
are the used in line 1013 inside class SymbolicTransformer on method transform to execute
those best programs.
Being honest, I checked all the code in github and it seems that the calculations are correct. However if I do it by hand I am able to fix the error, so I live here to report and to fix code. I installed gplearn through pip and I have version 0.2, but they may be different code and for that reason I am not able to find the erro in github. With the following example you can see perfectly what I am talking about
# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(Orig features): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))
# Output: Lars(Orig features): 0.775865035322 764.451209695
# Linear model - Transformed features
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Orig code): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Orig code): 0.838022410672 600.687128977
# Linear model - Hand made transform fitness_
index_fitness = np.argsort([prog.fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_fitness[i]] for i in range(gp.n_components)]
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Hand made fitness_): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Hand made fitness_): 1.0 1.50051021564e-07
# Linear model - Hand made transform oob_fitness_
index_oob_fitness = np.argsort([prog.oob_fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_oob_fitness[i]] for i in range(gp.n_components)]
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Hand made fitness_): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Hand made oob_fitness_): 1.0 3.90751682033e-07
# Linear model - "The feature"
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - The Feature): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - The Feature): 1.0 7.2901684689e-13
So, it is clear that everything works except the selection of the _best_programs
.
Also, it is strange that OOB_fitness_
gives the same result that "The feature". Looking the code I think that it might be dur to max_samples
variable, but it is just a guess without any testing.
Hope it helps to improve the code.
Just for curiosity, what does OOB mean? "Out of ..."? Which is the difference between fitness_
and oob_fitness_
and the objective of them?
I ask it because using the boston data of your example (of the documentation), taking the first 300 as training, the resutls are the following ones (I used Ridge and Lars to be consistent with doc. example and scrips above):
Ridge(Orig features): -0.260236502156 5.00110179084
Ridge(trans - Orig code): 0.465892381009 3.04110672459
Ridge(trans - Hand made fitness_): 0.252378627452 3.77678891869
Ridge(trans - Hand made oob_fitness_): 0.399317288976 3.3718350818
Lars(Orig features): -0.350599260885 5.16027432523
Lars(trans - Orig code): -14.8537565657 14.5151593774
Lars(trans - Hand made fitness_): -8384.20328411 402.904365315
Lars(trans - Hand made oob_fitness_): -5.06798482764 10.2869394223
Curious values as in boston works better the original transform code from the doc. example. However, using the dataset I used in the example, the results are (with Ridge and Lars):
Ridge(Orig features): 0.69342345079 676.190448752
Ridge(trans - Orig code): 0.75514334089 606.258605416
Ridge(trans - Hand made fitness_): 0.9999999997 0.0176059710462
Ridge(trans - Hand made oob_fitness_): 0.999999999605 0.019754171254
Ridge(trans - The Feature): 1.0 1.09972716373e-05
Lars(Orig features): 0.558940611669 757.375881816
Lars(trans - Orig code): 0.288037192098 824.847940324
Lars(trans - Hand made fitness_): 1.0 5.88329385209e-13
Lars(trans - Hand made oob_fitness_): 1.0 7.58141638357e-10
Lars(trans - The Feature): 1.0 5.88329385209e-13
Very curious... But no idea of what happens exactly with best_programs_
Looking to other issues, I have seen that this problem was reported by issue #42 I leave the solution any way for user that want to solve it with pip installed version.
Hi @iblasi , thank you greatly for your in depth description of the issue! :+1:
You appear to be correct. I am working on fixing #42 in #60 ... I believe I came across the same issue as you when debugging that problem. If you have the time, are you able to use the code from #60 to see if it generates the features as you would hope?
FWIW, most raised issues have related to the regressor, so this bug may well have been present but not found for a while.
OOB means "out of bag" as we use "bagging" to subsample the rows to evaluate.
@trevorstephens, I have seen that I have some errors on my first example code that they have been fixed in case you want to test exactly the code to see more clearly the bug.
I have tested your code and it still does not work properly.
In my opinion, although I understand what you are doing on fit
with the correlation coefficients to remove worst to take only the _best_programs
, I checked that there is where the problem comes.
I am not sure why you make so many operations instead of doing, as I show in my example, just:
index_fitness = np.argsort([prog.fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_fitness[i]] for i in range(gp.n_components)]
I mean, you have already measured fitness and you just want to sort through that score value. So just take the n_components
defined by user and return the best fitness
achieved.
You can still maintain the if self._metric.greater_is_better:
statement removing [::-1]
to avoid upside-down of fitness.
I modified the original code on gplearn.py (lines 480-505) with the following one and it works perfectly.
if isinstance(self, TransformerMixin):
# Find the best individuals in the final generation
fitness = np.array(fitness)
if self._metric.greater_is_better:
index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
else:
index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])
self._best_programs = [self._programs[-1][index_fitness[i]] for i in range(self.n_components)]
return self
I may be doing something wrong, so please check it, but the code works and gives the expected results.
@trevorstephens I already realized what I think you were trying to do with correlation matrix.
If I am right, you were making a hall of fame
and then ask your self to take n_components
of all of them that they were different between each other to avoid dependent, equal or very similar features generated.
In that case you should change in gplearn.py (line 504) to take the less correlated programs/equation worst = worst[np.argmin(np.sum(correlations[worst, :], 1))]
(note the change to argmin).
That worked for me also, although the results are not perfect with LARS, but that's logical as I am creating features less correlated (most distiguished) and not the top fitness
features that may be the exact features (like the example shown).
EDITED See better with example mentioned.
X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]
index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]
# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log', 'abs', 'neg', 'inv', 'max', 'min']
gp = SymbolicTransformer(generations=20, population_size=2000,
hall_of_fame=100, n_components=10,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=3)
gp.fit(X_train, y_train)
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin.fit(newX[:index,:], y_train)
print 'Lars: ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:])), est_lin.coef_
# Best of Hall of Fame
for i in range(gp.n_components):
print 'fitness: ', gp._best_programs[i].fitness_, ' - OOB_fitness: ', gp._best_programs[i].oob_fitness_, ' - Equation:', str(gp._best_programs[i])
Using actual code gplearn-fix-second-issue-from-42, the output:
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 14.62 0.341830473377 13 0.980737768321 0.923335544581 22.97s
1 16.7 0.586896359115 10 0.999958560964 0.999807398718 41.50s
2 24.63 0.729392599276 23 1.0 1.0 54.02s
Lars(trans - Orig code): -0.428482827341 1681.25844662 [ 6.53825764e+02 -5.82911477e+00 -3.30141921e+02 -1.28659452e+01
6.70911585e-02 5.78386284e-04 3.09191951e+00 1.79057224e-01
-8.23886970e-01 6.99084916e+01 3.42641835e-01 -5.64189101e+01
-1.68442146e-01]
fitness: 0.957459388729 - OOB_fitness: 0.943425675292 - Equation: mul(sqrt(neg(mul(min(X1, X2), min(X0, X1)))), add(log(sqrt(log(inv(-0.520)))), sqrt(sub(mul(sqrt(X0), min(X0, X2)), div(mul(-0.673, -0.937), mul(X1, X2))))))
fitness: 0.966664740558 - OOB_fitness: 0.971917258901 - Equation: add(div(abs(X0), inv(X2)), neg(mul(mul(X1, X0), sub(X2, 0.904))))
fitness: 0.952392215033 - OOB_fitness: 0.937331147986 - Equation: neg(sub(mul(sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2))), sub(neg(sqrt(-0.758)), min(div(X2, -0.008), inv(X1)))), sub(min(add(min(X1, X2), mul(-0.902, X2)), abs(inv(X2))), neg(mul(min(X0, 0.416), div(0.577, 0.194))))))
fitness: 0.97197765133 - OOB_fitness: 0.998886554151 - Equation: mul(min(X1, X2), min(X0, X2))
fitness: 0.956167302787 - OOB_fitness: 0.895024815031 - Equation: mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(X1, 0.570), log(0.341)), neg(sqrt(X1))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))), abs(abs(log(max(neg(X2), add(X1, X1))))))
fitness: 0.970261237422 - OOB_fitness: 0.987509757459 - Equation: mul(add(div(X2, mul(add(X1, X0), mul(0.211, sub(-0.603, 0.299)))), 0.107), sub(max(X2, X2), mul(X1, X0)))
fitness: 0.978940700006 - OOB_fitness: 0.990502720054 - Equation: add(div(abs(X0), log(0.901)), neg(mul(min(X0, min(X1, X1)), sub(X2, 0.904))))
fitness: 0.988439517202 - OOB_fitness: 1.0 - Equation: mul(neg(X2), min(min(X0, X1), X2))
fitness: 0.985085078049 - OOB_fitness: 0.997438798978 - Equation: sub(add(sub(div(X2, 0.309), neg(X2)), mul(neg(X2), sqrt(X2))), mul(min(X1, X0), abs(sub(X2, 0.504))))
fitness: 0.977089527904 - OOB_fitness: 0.999715095651 - Equation: neg(sub(mul(min(min(X0, X1), add(X2, min(X2, X0))), sub(neg(sqrt(-0.758)), min(div(X2, -0.008), inv(X1)))), sub(min(add(min(X1, X2), mul(X0, X2)), abs(inv(X2))), neg(mul(min(X0, 0.416), div(0.577, 0.194))))))
MAE is 1681.25844662 (depends on random_state used but it is always high). So clearly it is not correct.
Using argmin mentioned as worst = worst[np.argmin(np.sum(correlations[worst, :], 1))]
you may take good features or "the feature" in this case.
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 14.62 0.348067860081 13 0.962599606892 0.978903441836 22.29s
1 17.73 0.58764177602 6 1.0 1.0 36.75s
Lars: 1.0 3.77298192689e-13 [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
fitness: 0.887486179235 - OOB_fitness: 0.939257217581 - Equation: mul(min(mul(sqrt(max(0.070, X2)), abs(log(X1))), min(abs(min(X0, X0)), max(abs(X1), div(-0.969, 0.919)))), neg(neg(X2)))
fitness: 0.896724908859 - OOB_fitness: 0.774177697706 - Equation: mul(sqrt(neg(add(min(X1, X0), neg(sqrt(X0))))), add(log(sqrt(log(inv(-0.520)))), sqrt(sub(mul(sqrt(X0), min(X0, X2)), div(mul(-0.673, -0.937), mul(X1, X2))))))
fitness: 0.90185742783 - OOB_fitness: 0.913124629422 - Equation: mul(min(mul(sqrt(max(0.070, X2)), abs(log(X1))), min(abs(min(X0, X0)), max(abs(X1), div(-0.969, 0.919)))), abs(neg(abs(-0.805))))
fitness: 0.905559461075 - OOB_fitness: 0.94344057166 - Equation: neg(mul(inv(add(inv(min(min(X0, X0), min(X2, X0))), div(mul(inv(X1), inv(X1)), inv(log(X2))))), X2))
fitness: 0.891892597392 - OOB_fitness: 0.741803381846 - Equation: div(min(X2, add(sub(neg(sub(0.096, abs(-0.575))), min(min(X2, X1), X2)), min(log(max(X1, X2)), abs(sqrt(X1))))), max(div(min(log(sqrt(X0)), X1), min(sub(sub(X2, X2), add(X0, X0)), max(div(X0, X0), abs(X2)))), min(min(-0.779, X0), sub(X0, div(min(X0, 0.657), neg(X1))))))
fitness: 0.916011339582 - OOB_fitness: 0.828693451632 - Equation: inv(add(inv(min(min(X0, X0), min(X2, X0))), div(mul(inv(X1), inv(X1)), inv(log(X2)))))
fitness: 0.928574574624 - OOB_fitness: 0.70923178817 - Equation: div(min(X2, X0), inv(X1))
fitness: 0.925215746576 - OOB_fitness: 0.813373952474 - Equation: mul(sqrt(div(X0, div(-0.100, X1))), mul(log(sqrt(X2)), min(abs(X1), div(X2, -0.385))))
fitness: 0.944269300313 - OOB_fitness: 0.761873869578 - Equation: mul(neg(mul(div(X2, -0.678), neg(X1))), div(inv(sqrt(X1)), abs(inv(X0))))
fitness: 0.997 - OOB_fitness: 1.0 - Equation: div(X2, inv(min(X0, X1)))
The MAE is 3.77298192689e-13 which is perfect. But you make take other features and not the highest fitness
. If you run several times this option, sometimes does not converge and MAE is not ~zero.
Using my proposed index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
, most of the results are the same equation so they are dependent/similar/same as mentioned. The MAE is 4.13535872212e-13, let's say perfect as expected. But just expected because the feature created is exactly the target equation. If we are looking for different added value features, using correlation coefficients seems to have more sense.
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 14.62 0.358673286081 15 0.977149386266 0.923139960675 22.40s
1 15.4 0.626041566983 11 0.999943066381 0.999969695391 36.42s
2 23.91 0.737708119156 23 1.0 1.0 45.02s
Lars: 1.0 4.13535872212e-13 [ 0. 0. 0. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
fitness: 0.997 - OOB_fitness: 1.0 - Equation: mul(min(X1, X0), neg(X2))
fitness: 0.994449946246 - OOB_fitness: 0.999899881163 - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness: 0.994449639861 - OOB_fitness: 0.999905066915 - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness: 0.9944454337 - OOB_fitness: 0.999955978128 - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness: 0.9944379586 - OOB_fitness: 0.999983214347 - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness: 0.994433378141 - OOB_fitness: 0.999997279781 - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness: 0.994243226944 - OOB_fitness: 0.999897627664 - Equation: sub(X0, mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness: 0.993668918502 - OOB_fitness: 0.999729768531 - Equation: sub(X2, mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness: 0.993599015477 - OOB_fitness: 0.999705898121 - Equation: sub(log(X0), mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness: 0.993499415109 - OOB_fitness: 0.999999998 - Equation: sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0), neg(X2)))
As summary, I am not sure that using correlation coefficients is the best choice as you may miss the highest fitness
features.
So you may be interested on using a simple but sparse selection of hall of fame _programs
such as:
index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
self._best_programs = [self._programs[-1][index_fitness[i]] for i in np.arange(0,self.hall_of_fame,1.*self.hall_of_fame/self.n_components).astype(np.int)]
that uses a linear selection of hall_of_fame
including always the first highest 'fitness' result. Or more complicated one, using logarithmic to give more importance to highest fitness
, or ...
Thanks again @iblasi , you are correct in that the algorithm tries to pick out the least correlated features of the hall_of_fame so that they can then be used in another model without as much collinearity in the features. I do see your point in terms of potential removal of key programs in the group. In order to maintain the idea of removal of correlated programs, but maintain the best programs where possible, I think that my latest change might tick all boxes.
Currently the code checks for the most correlated pair of programs from the hall of fame, and then removes the one that is also most correlated with all other programs left in the group. This could easily remove the top programs from the field as they are also most likely to have many other correlated "clones" in the generation.
Instead I now propose to find the most correlated pair as before, but then remove the one with the worst fitness of the two -- and then iterate until the group is reduced to the required number of components. Take a look at the current version of #60 and let me know what you think.
Really appreciate your taking the time to dig into this one! :+1:
@trevorstephens just perfect. Good job!
Just one comment that does not apply to final result, and may not improve too much speed, but when you use unravel_index
, you are finding the exact index (row,col) in correlation matrix, which it is the hall_of_fame
variable where fitness
values were sorted previously.
So as the correlation matrix rows/cols are ordered based on fitness
, the maximum value of unravel_index
is the worst that has to be used to delete. I mean, you can write it in the following way:
[...]
while len(components) > self.n_components:
most_correlated = np.unravel_index(np.argmax(correlations),
correlations.shape)
if self._metric.greater_is_better:
worst = max(most_correlated)
else:
worst = min(most_correlated)
components.pop(worst)
indices.remove(worst)
correlations = correlations[:, indices][indices, :]
indices = list(range(len(components)))
self._best_programs = [self._programs[-1][i] for i in
hall_of_fame[components]]
return self
But as I said, it does not improve the result as it makes the same operations. Thank you for your good point of view and for finding a good solution.
That's a great observation @iblasi , simplifies the code a fair bit. And as the greater_is_better == False
has the seq reversed, it doesn't even need that logic. Cheers :beers:
Hi @trevorstephens ,
I am not sure if this is a bug, or the documentation is not correct focused refered to SymbolicTransformer. I have done a show case of how SymbolicRegressor works and predicts well the equation that represents the dataset, while SymbolicTransformer does not work in the same way.
Starting with SymbolicRegressor, I have done a "easy" dataset to check if SymbolicRegressor give me the correct result and good metrics.
This example give us a perfect result and the MAE metrics is ~perfect as shows the output:
However, SymbolicTransformer although the training works well, the transform does not work well. See next same example to previous one but with SymbolicTransformer:
I use Lars from sklearn for avoid Ridge sparse weights, and find the best solution fast for this easy and exact example. As it can be seen on the results of this code (below), the features that are generated with transform, although during the fit fitness become perfect, the added transformed features seem to be worng. The problem does not come from Lars, as last example of Lars shows that adding "the feature" which is the target, the accuracy is perfetc.
So I decided to see the fitted features created during the fit and some of them are perfect, however, the transform seems not to use them correctly on
gp_features
createdIs this a bug? I am doing the same thing as explained on SymbolicTransformer example