SymbolicTransformer does not create added value features as expected

Hi @trevorstephens ,

I am not sure if this is a bug, or the documentation is not correct focused refered to SymbolicTransformer. I have done a show case of how SymbolicRegressor works and predicts well the equation that represents the dataset, while SymbolicTransformer does not work in the same way.

Starting with SymbolicRegressor, I have done a "easy" dataset to check if SymbolicRegressor give me the correct result and good metrics.

from gplearn.genetic import SymbolicRegressor
from sklearn import metrics
import pandas as pd
import numpy as np

# Load data
X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]

index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos', 'tan']

est_gp = SymbolicRegressor(population_size=5000,
                           generations=20, stopping_criteria=0.001,
                           function_set=function_set,
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=0.9, verbose=1,
                           n_jobs=1,
                           parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X_train, y_train)

print 'Score: ', est_gp.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_gp.predict(X_test))
print est_gp._program

This example give us a perfect result and the MAE metrics is ~perfect as shows the output:

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    11.81    8396.89543051       10    25.3022470326     26.608049431     35.35s
   1    12.36    8904.35549713        8    20.0284767508    19.0994923956     37.34s
   2    13.74     37263.312834        8 7.82583874247e-14 2.13162820728e-14     36.67s
Score:  1.0 5.71986902287e-14
abs(div(neg(X2), inv(min(X0, X1))))

However, SymbolicTransformer although the training works well, the transform does not work well. See next same example to previous one but with SymbolicTransformer:

from gplearn.genetic import SymbolicRegressor,SymbolicTransformer
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics

X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]

index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(orig): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))

# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min']

gp = SymbolicTransformer(generations=20, population_size=2000,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0, n_jobs=3)

gp.fit(X_train, y_train)
gp_features = gp.transform(X)

# Linear model - Transformed features
newX = np.hstack((X, gp_features))
print 'newX: ', np.shape(newX)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))

# Linear model - "The" feature
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
print 'newX: ', newX.shape
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))

I use Lars from sklearn for avoid Ridge sparse weights, and find the best solution fast for this easy and exact example. As it can be seen on the results of this code (below), the features that are generated with transform, although during the fit fitness become perfect, the added transformed features seem to be worng. The problem does not come from Lars, as last example of Lars shows that adding "the feature" which is the target, the accuracy is perfetc.

X:  (100, 3)
y:  (100,)
Lars(orig):  0.850145084161 518.34496409
    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.349810294784        6   0.954248106272   0.939129495332     16.04s
   1    16.01   0.601354215127        6              1.0              1.0     25.56s
newX:  (100, 13)
Lars(trans):  0.83552794823 497.438879508
newX:  (100, 4)
Lars(trans):  1.0 1.60411683936e-12

So I decided to see the fitted features created during the fit and some of them are perfect, however, the transform seems not to use them correctly on gp_features created

>>>print 'Eq. of new features: ', gp.__str__()
 mul(mul(neg(sqrt(min(neg(mul(mul(X1, X0), add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0), sub(X2, 0.904)))))), X2))), sqrt(max(X2, X2))), X1),
 div(min(div(abs(X0), log(0.901)), log(max(X2, -0.222))), X0),
 mul(sub(X1, X0), mul(X1, X0)),
 mul(X2, inv(X2)),
 mul(mul(neg(sqrt(min(X0, X2))), add(neg(X0), min(X0, X2))), X1),
 div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min(X0, X2))), mul(neg(X2), max(X1, X1))), X1))),
 div(abs(mul(X0, X2)), inv(mul(0.640, mul(X1, X0)))),
 div(abs(mul(X0, X2)), inv(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(-0.603, 0.299), sub(0.063, X1)), neg(min(X1, -0.125))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))))),
 mul(neg(mul(div(X2, -0.678), neg(X1))), div(sqrt(max(X2, X2)), min(X1, X0)))]
>>>
>>>df = pd.DataFrame(columns=['Gen','OOB_fitness','Equation'])
>>>for idGen in range(len(gp._programs)):
>>>   for idPopulation in range(gp.population_size):
>>>      if(gp._programs[idGen][idPopulation] != None):
>>>         df = df.append({'fitness': value_fitness_, 'OOB_fitness': value_oobfitness_, 'Equation': str(gp._programs[-1][idPopulation])}, ignore_index=True)
>>>
>>>print 'Best of last Gen: '
>>>print df[df['Gen']==df['Gen'].max()].sort_values('OOB_fitness')
Best of last Gen: 
      Gen  OOB_fitness                                           Equation
1126  2.0     0.000000                            add(0.944, sub(X0, X0))
952   2.0     0.000000                      div(min(X2, X0), min(X2, X0))
1530  2.0     0.000000  min(inv(neg(abs(log(min(X1, 0.535))))), neg(su...
2146  2.0     0.000000  div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min...
2148  2.0     0.000000  div(min(add(-0.868, -0.285), X2), sqrt(sqrt(0....
2150  2.0     0.000000                                 sub(-0.603, 0.299)
2476  2.0     0.000000  min(min(max(X0, X2), add(-0.738, 0.612)), sqrt...
1601  2.0     0.000000                               neg(min(X1, -0.125))
1271  2.0     0.000000                                 add(-0.504, 0.058)
1742  2.0     0.000000  add(inv(log(abs(-0.575))), inv(log(abs(-0.575))))
733   2.0     0.000000                                        abs(-0.575)
1304  2.0     0.000000                                  abs(sqrt(-0.758))
1630  2.0     0.000000  div(abs(mul(X0, X2)), inv(mul(max(X2, X2), add...
652   2.0     0.000000                                         log(0.341)
1708  2.0     0.000000                                              0.904
2262  2.0     0.000000                                       sqrt(-0.715)
1338  2.0     0.000000                               mul(X2, sub(X1, X1))
826   2.0     0.000000  div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
1615  2.0     0.000000                             abs(add(0.640, 0.766))
2415  2.0     0.000000                                   log(abs(-0.575))
1670  2.0     0.000000                                     min(X0, 0.657)
1644  2.0     0.000000                               log(min(-0.524, X0))
2361  2.0     0.000000                                              0.944
785   2.0     0.000000  min(inv(log(abs(log(min(X1, 0.535))))), neg(mu...
2367  2.0     0.000000                                        abs(-0.911)
2249  2.0     0.000000                                              0.904
960   2.0     0.000000                                   inv(inv(-0.045))
955   2.0     0.000000                 div(add(X1, X2), inv(sub(X2, X2)))
2397  2.0     0.000000                                             -0.125
1878  2.0     0.000000  div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
...   ...          ...                                                ...
1103  2.0     0.997786        mul(X2, abs(sub(mul(X0, X1), add(X2, X0))))
2225  2.0     0.997790  mul(sub(min(log(div(X0, -0.717)), neg(sqrt(mul...
1890  2.0     0.998069  mul(sub(div(X2, 0.309), neg(X2)), sub(max(X2, ...
1704  2.0     0.998283  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1829  2.0     0.998284  add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
700   2.0     0.998345  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1770  2.0     0.998638  mul(add(min(X0, min(X1, X1)), X2), sqrt(abs(ab...
2344  2.0     0.998692  div(min(X2, add(sub(neg(sub(0.096, abs(-0.575)...
985   2.0     0.998793  sub(min(mul(sub(min(sqrt(log(max(X1, X2))), ne...
1634  2.0     0.998815  add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
1412  2.0     0.998945  mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(m...
855   2.0     0.998965  add(inv(log(abs(X1))), neg(mul(mul(X1, X0), su...
839   2.0     0.998996  add(inv(abs(add(min(X0, min(X1, X1)), X2))), n...
1528  2.0     0.999066  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
690   2.0     0.999875  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
2047  2.0     0.999895  sub(min(neg(X1), div(X1, X2)), sub(min(abs(X1)...
1951  2.0     0.999921  sub(min(min(X2, X0), X2), mul(min(X1, X0), neg...
1981  2.0     0.999954  mul(X2, neg(neg(min(add(0.448, X0), sub(X1, -0...
2349  2.0     0.999954   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2364  2.0     0.999960  add(inv(log(abs(-0.575))), mul(X2, neg(neg(min...
2487  2.0     0.999971   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2056  2.0     0.999975   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1559  2.0     0.999976    mul(X2, neg(neg(min(add(0.448, X0), abs(X1)))))
975   2.0     0.999982   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2032  2.0     0.999992   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1288  2.0     1.000000  sub(min(div(-0.992, X2), X2), mul(min(X1, X0),...
2482  2.0     1.000000  sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0...
1776  2.0     1.000000  mul(min(mul(add(X0, X0), abs(log(X1))), min(ab...
2392  2.0     1.000000  mul(neg(X2), max(div(0.933, X0), min(X0, min(X...
1329  2.0     1.000000                          mul(min(X1, X0), neg(X2))

[2000 rows x 3 columns]

Is this a bug? I am doing the same thing as explained on SymbolicTransformer example

Hi @trevorstephens

Looking to all the source code I understand why happens the previously commented situation, and I have a recomendation to improve the SymbolicTransformer if you consider. To use SymbolicTransformer, you are doing the following steps on code genetic.py to extract complex added value features.

You sort by fitness on line 479 hall_of_fame = fitness.argsort()[:self.hall_of_fame] and then take the number of components to fix them as _best_programs on line 500 self._best_programs = [self._programs[-1][i] for i in hall_of_fame[components]]. The _best_programs are the used in line 1013 inside class SymbolicTransformer on method transform to execute those best programs.

Being honest, I checked all the code in github and it seems that the calculations are correct. However if I do it by hand I am able to fix the error, so I live here to report and to fix code. I installed gplearn through pip and I have version 0.2, but they may be different code and for that reason I am not able to find the erro in github. With the following example you can see perfectly what I am talking about

# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(Orig features): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))
# Output: Lars(Orig features):  0.775865035322 764.451209695

# Linear model - Transformed features
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Orig code): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Orig code):  0.838022410672 600.687128977

# Linear model - Hand made transform fitness_
index_fitness = np.argsort([prog.fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_fitness[i]] for i in range(gp.n_components)]
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Hand made fitness_): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Hand made fitness_):  1.0 1.50051021564e-07

# Linear model - Hand made transform oob_fitness_
index_oob_fitness = np.argsort([prog.oob_fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_oob_fitness[i]] for i in range(gp.n_components)]
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - Hand made fitness_): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - Hand made oob_fitness_):  1.0 3.90751682033e-07

# Linear model - "The feature"
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans - The Feature): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))
# Output: Lars(trans - The Feature):  1.0 7.2901684689e-13

So, it is clear that everything works except the selection of the _best_programs. Also, it is strange that OOB_fitness_ gives the same result that "The feature". Looking the code I think that it might be dur to max_samples variable, but it is just a guess without any testing.

Hope it helps to improve the code. Just for curiosity, what does OOB mean? "Out of ..."? Which is the difference between fitness_ and oob_fitness_ and the objective of them?

I ask it because using the boston data of your example (of the documentation), taking the first 300 as training, the resutls are the following ones (I used Ridge and Lars to be consistent with doc. example and scrips above):

Ridge(Orig features):  -0.260236502156 5.00110179084
Ridge(trans - Orig code):  0.465892381009 3.04110672459
Ridge(trans - Hand made fitness_):  0.252378627452 3.77678891869
Ridge(trans - Hand made oob_fitness_):  0.399317288976 3.3718350818

Lars(Orig features):  -0.350599260885 5.16027432523
Lars(trans - Orig code):  -14.8537565657 14.5151593774
Lars(trans - Hand made fitness_):  -8384.20328411 402.904365315
Lars(trans - Hand made oob_fitness_):  -5.06798482764 10.2869394223

Curious values as in boston works better the original transform code from the doc. example. However, using the dataset I used in the example, the results are (with Ridge and Lars):

Ridge(Orig features):  0.69342345079 676.190448752
Ridge(trans - Orig code): 0.75514334089 606.258605416
Ridge(trans - Hand made fitness_):  0.9999999997 0.0176059710462
Ridge(trans - Hand made oob_fitness_):  0.999999999605 0.019754171254
Ridge(trans - The Feature):  1.0 1.09972716373e-05

Lars(Orig features):  0.558940611669 757.375881816
Lars(trans - Orig code):  0.288037192098 824.847940324
Lars(trans - Hand made fitness_):  1.0 5.88329385209e-13
Lars(trans - Hand made oob_fitness_):  1.0 7.58141638357e-10
Lars(trans - The Feature):  1.0 5.88329385209e-13

Very curious... But no idea of what happens exactly with best_programs_

Looking to other issues, I have seen that this problem was reported by issue #42 I leave the solution any way for user that want to solve it with pip installed version.

Hi @iblasi , thank you greatly for your in depth description of the issue! :+1:

You appear to be correct. I am working on fixing #42 in #60 ... I believe I came across the same issue as you when debugging that problem. If you have the time, are you able to use the code from #60 to see if it generates the features as you would hope?

FWIW, most raised issues have related to the regressor, so this bug may well have been present but not found for a while.

OOB means "out of bag" as we use "bagging" to subsample the rows to evaluate.

@trevorstephens, I have seen that I have some errors on my first example code that they have been fixed in case you want to test exactly the code to see more clearly the bug.

I have tested your code and it still does not work properly. In my opinion, although I understand what you are doing on fitwith the correlation coefficients to remove worst to take only the _best_programs, I checked that there is where the problem comes. I am not sure why you make so many operations instead of doing, as I show in my example, just:

index_fitness = np.argsort([prog.fitness_ for prog in gp._programs[-1]])[::-1]
gp._best_programs = [gp._programs[-1][index_fitness[i]] for i in range(gp.n_components)]

I mean, you have already measured fitness and you just want to sort through that score value. So just take the n_componentsdefined by user and return the best fitness achieved.

You can still maintain the if self._metric.greater_is_better:statement removing [::-1]to avoid upside-down of fitness. I modified the original code on gplearn.py (lines 480-505) with the following one and it works perfectly.

        if isinstance(self, TransformerMixin):
            # Find the best individuals in the final generation
            fitness = np.array(fitness)
            if self._metric.greater_is_better:
              index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
            else:
              index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])

            self._best_programs = [self._programs[-1][index_fitness[i]] for i in range(self.n_components)]

        return self

I may be doing something wrong, so please check it, but the code works and gives the expected results.

@trevorstephens I already realized what I think you were trying to do with correlation matrix. If I am right, you were making a hall of fameand then ask your self to take n_components of all of them that they were different between each other to avoid dependent, equal or very similar features generated. In that case you should change in gplearn.py (line 504) to take the less correlated programs/equation worst = worst[np.argmin(np.sum(correlations[worst, :], 1))] (note the change to argmin).

That worked for me also, although the results are not perfect with LARS, but that's logical as I am creating features less correlated (most distiguished) and not the top fitness features that may be the exact features (like the example shown).

EDITED See better with example mentioned.

X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]
index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log', 'abs', 'neg', 'inv', 'max', 'min']

gp = SymbolicTransformer(generations=20, population_size=2000,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0, n_jobs=3)

gp.fit(X_train, y_train)
gp_features = gp.transform(X)
newX = np.hstack((X, gp_features))
est_lin.fit(newX[:index,:], y_train)
print 'Lars: ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:])), est_lin.coef_

# Best of Hall of Fame
for i in range(gp.n_components):
        print 'fitness: ', gp._best_programs[i].fitness_, ' - OOB_fitness: ',  gp._best_programs[i].oob_fitness_, ' - Equation:',  str(gp._best_programs[i])

Using actual code gplearn-fix-second-issue-from-42, the output:

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.341830473377       13   0.980737768321   0.923335544581     22.97s
   1     16.7   0.586896359115       10   0.999958560964   0.999807398718     41.50s
   2    24.63   0.729392599276       23              1.0              1.0     54.02s
Lars(trans - Orig code):  -0.428482827341 1681.25844662 [  6.53825764e+02  -5.82911477e+00  -3.30141921e+02  -1.28659452e+01
   6.70911585e-02   5.78386284e-04   3.09191951e+00   1.79057224e-01
  -8.23886970e-01   6.99084916e+01   3.42641835e-01  -5.64189101e+01
  -1.68442146e-01]
fitness:  0.957459388729  - OOB_fitness:  0.943425675292  - Equation: mul(sqrt(neg(mul(min(X1, X2), min(X0, X1)))), add(log(sqrt(log(inv(-0.520)))), sqrt(sub(mul(sqrt(X0), min(X0, X2)), div(mul(-0.673, -0.937), mul(X1, X2))))))
fitness:  0.966664740558  - OOB_fitness:  0.971917258901  - Equation: add(div(abs(X0), inv(X2)), neg(mul(mul(X1, X0), sub(X2, 0.904))))
fitness:  0.952392215033  - OOB_fitness:  0.937331147986  - Equation: neg(sub(mul(sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2))), sub(neg(sqrt(-0.758)), min(div(X2, -0.008), inv(X1)))), sub(min(add(min(X1, X2), mul(-0.902, X2)), abs(inv(X2))), neg(mul(min(X0, 0.416), div(0.577, 0.194))))))
fitness:  0.97197765133  - OOB_fitness:  0.998886554151  - Equation: mul(min(X1, X2), min(X0, X2))
fitness:  0.956167302787  - OOB_fitness:  0.895024815031  - Equation: mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(X1, 0.570), log(0.341)), neg(sqrt(X1))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))), abs(abs(log(max(neg(X2), add(X1, X1))))))
fitness:  0.970261237422  - OOB_fitness:  0.987509757459  - Equation: mul(add(div(X2, mul(add(X1, X0), mul(0.211, sub(-0.603, 0.299)))), 0.107), sub(max(X2, X2), mul(X1, X0)))
fitness:  0.978940700006  - OOB_fitness:  0.990502720054  - Equation: add(div(abs(X0), log(0.901)), neg(mul(min(X0, min(X1, X1)), sub(X2, 0.904))))
fitness:  0.988439517202  - OOB_fitness:  1.0  - Equation: mul(neg(X2), min(min(X0, X1), X2))
fitness:  0.985085078049  - OOB_fitness:  0.997438798978  - Equation: sub(add(sub(div(X2, 0.309), neg(X2)), mul(neg(X2), sqrt(X2))), mul(min(X1, X0), abs(sub(X2, 0.504))))
fitness:  0.977089527904  - OOB_fitness:  0.999715095651  - Equation: neg(sub(mul(min(min(X0, X1), add(X2, min(X2, X0))), sub(neg(sqrt(-0.758)), min(div(X2, -0.008), inv(X1)))), sub(min(add(min(X1, X2), mul(X0, X2)), abs(inv(X2))), neg(mul(min(X0, 0.416), div(0.577, 0.194))))))

MAE is 1681.25844662 (depends on random_state used but it is always high). So clearly it is not correct.

Using argmin mentioned as worst = worst[np.argmin(np.sum(correlations[worst, :], 1))] you may take good features or "the feature" in this case.

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.348067860081       13   0.962599606892   0.978903441836     22.29s
   1    17.73    0.58764177602        6              1.0              1.0     36.75s
Lars:  1.0 3.77298192689e-13 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
fitness:  0.887486179235  - OOB_fitness:  0.939257217581  - Equation: mul(min(mul(sqrt(max(0.070, X2)), abs(log(X1))), min(abs(min(X0, X0)), max(abs(X1), div(-0.969, 0.919)))), neg(neg(X2)))
fitness:  0.896724908859  - OOB_fitness:  0.774177697706  - Equation: mul(sqrt(neg(add(min(X1, X0), neg(sqrt(X0))))), add(log(sqrt(log(inv(-0.520)))), sqrt(sub(mul(sqrt(X0), min(X0, X2)), div(mul(-0.673, -0.937), mul(X1, X2))))))
fitness:  0.90185742783  - OOB_fitness:  0.913124629422  - Equation: mul(min(mul(sqrt(max(0.070, X2)), abs(log(X1))), min(abs(min(X0, X0)), max(abs(X1), div(-0.969, 0.919)))), abs(neg(abs(-0.805))))
fitness:  0.905559461075  - OOB_fitness:  0.94344057166  - Equation: neg(mul(inv(add(inv(min(min(X0, X0), min(X2, X0))), div(mul(inv(X1), inv(X1)), inv(log(X2))))), X2))
fitness:  0.891892597392  - OOB_fitness:  0.741803381846  - Equation: div(min(X2, add(sub(neg(sub(0.096, abs(-0.575))), min(min(X2, X1), X2)), min(log(max(X1, X2)), abs(sqrt(X1))))), max(div(min(log(sqrt(X0)), X1), min(sub(sub(X2, X2), add(X0, X0)), max(div(X0, X0), abs(X2)))), min(min(-0.779, X0), sub(X0, div(min(X0, 0.657), neg(X1))))))
fitness:  0.916011339582  - OOB_fitness:  0.828693451632  - Equation: inv(add(inv(min(min(X0, X0), min(X2, X0))), div(mul(inv(X1), inv(X1)), inv(log(X2)))))
fitness:  0.928574574624  - OOB_fitness:  0.70923178817  - Equation: div(min(X2, X0), inv(X1))
fitness:  0.925215746576  - OOB_fitness:  0.813373952474  - Equation: mul(sqrt(div(X0, div(-0.100, X1))), mul(log(sqrt(X2)), min(abs(X1), div(X2, -0.385))))
fitness:  0.944269300313  - OOB_fitness:  0.761873869578  - Equation: mul(neg(mul(div(X2, -0.678), neg(X1))), div(inv(sqrt(X1)), abs(inv(X0))))
fitness:  0.997  - OOB_fitness:  1.0  - Equation: div(X2, inv(min(X0, X1)))

The MAE is 3.77298192689e-13 which is perfect. But you make take other features and not the highest fitness. If you run several times this option, sometimes does not converge and MAE is not ~zero.

Using my proposed index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1], most of the results are the same equation so they are dependent/similar/same as mentioned. The MAE is 4.13535872212e-13, let's say perfect as expected. But just expected because the feature created is exactly the target equation. If we are looking for different added value features, using correlation coefficients seems to have more sense.

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.358673286081       15   0.977149386266   0.923139960675     22.40s
   1     15.4   0.626041566983       11   0.999943066381   0.999969695391     36.42s
   2    23.91   0.737708119156       23              1.0              1.0     45.02s
Lars:  1.0 4.13535872212e-13 [ 0.  0.  0. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
fitness:  0.997  - OOB_fitness:  1.0  - Equation: mul(min(X1, X0), neg(X2))
fitness:  0.994449946246  - OOB_fitness:  0.999899881163  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.994449639861  - OOB_fitness:  0.999905066915  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.9944454337  - OOB_fitness:  0.999955978128  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.9944379586  - OOB_fitness:  0.999983214347  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.994433378141  - OOB_fitness:  0.999997279781  - Equation: sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
fitness:  0.994243226944  - OOB_fitness:  0.999897627664  - Equation: sub(X0, mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness:  0.993668918502  - OOB_fitness:  0.999729768531  - Equation: sub(X2, mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness:  0.993599015477  - OOB_fitness:  0.999705898121  - Equation: sub(log(X0), mul(min(X1, X0), add(abs(X2), sqrt(X2))))
fitness:  0.993499415109  - OOB_fitness:  0.999999998  - Equation: sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0), neg(X2)))

As summary, I am not sure that using correlation coefficients is the best choice as you may miss the highest fitness features. So you may be interested on using a simple but sparse selection of hall of fame _programs such as:

index_fitness = np.argsort([prog.fitness_ for prog in self._programs[-1]])[::-1]
self._best_programs = [self._programs[-1][index_fitness[i]] for i in np.arange(0,self.hall_of_fame,1.*self.hall_of_fame/self.n_components).astype(np.int)]

that uses a linear selection of hall_of_fame including always the first highest 'fitness' result. Or more complicated one, using logarithmic to give more importance to highest fitness, or ...

Thanks again @iblasi , you are correct in that the algorithm tries to pick out the least correlated features of the hall_of_fame so that they can then be used in another model without as much collinearity in the features. I do see your point in terms of potential removal of key programs in the group. In order to maintain the idea of removal of correlated programs, but maintain the best programs where possible, I think that my latest change might tick all boxes.

Currently the code checks for the most correlated pair of programs from the hall of fame, and then removes the one that is also most correlated with all other programs left in the group. This could easily remove the top programs from the field as they are also most likely to have many other correlated "clones" in the generation.

Instead I now propose to find the most correlated pair as before, but then remove the one with the worst fitness of the two -- and then iterate until the group is reduced to the required number of components. Take a look at the current version of #60 and let me know what you think.

Really appreciate your taking the time to dig into this one! :+1:

@trevorstephens just perfect. Good job!

Just one comment that does not apply to final result, and may not improve too much speed, but when you use unravel_index, you are finding the exact index (row,col) in correlation matrix, which it is the hall_of_fame variable where fitness values were sorted previously. So as the correlation matrix rows/cols are ordered based on fitness, the maximum value of unravel_index is the worst that has to be used to delete. I mean, you can write it in the following way:

            [...]
            while len(components) > self.n_components:
                most_correlated = np.unravel_index(np.argmax(correlations),
                                                   correlations.shape)
                if self._metric.greater_is_better:
                    worst = max(most_correlated)
                else:
                    worst = min(most_correlated)
                components.pop(worst)
                indices.remove(worst)
                correlations = correlations[:, indices][indices, :]
                indices = list(range(len(components)))
            self._best_programs = [self._programs[-1][i] for i in
                                   hall_of_fame[components]]

        return self

But as I said, it does not improve the result as it makes the same operations. Thank you for your good point of view and for finding a good solution.

That's a great observation @iblasi , simplifies the code a fair bit. And as the greater_is_better == False has the seq reversed, it doesn't even need that logic. Cheers :beers:

trevorstephens / gplearn

SymbolicTransformer does not create added value features as expected #50