hengzhe-zhang / EvolutionaryForest

An open source python library for automated feature engineering based on Genetic Programming
GNU Lesser General Public License v3.0
136 stars 25 forks source link

get_feature_importance failing when there are more features - likely issue with latex parsing #83

Closed minghao51 closed 1 year ago

minghao51 commented 1 year ago

Description

Once I have more than a certain number of features, the parsing of latex will typically fail during the get_feature_importance,

While `get_feature_importance(r, simple_version=True) will still work).

There are several types of error though (listing screenshots of what I got).

image

image

Seems like issues with parsing the lambda operations into math symbol, sometimes it missed a feature name, sometimes it runs into issues with other lambda description

Any feature naming convention should i follow to avoid these?

Code

To reproduce it with example codes (modifying the tutorial code with more features)

import random
import string
import pandas as pd
import numpy as np
from sklearn.datasets import make_friedman1
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

from evolutionary_forest.forest import EvolutionaryForestRegressor

random.seed(0)
np.random.seed(0)

# Generate dataset
X, y = make_friedman1(n_samples=500, n_features=17, random_state=0)

# Convert numpy arrays to pandas dataframe
X = pd.DataFrame(X, columns=list(string.ascii_uppercase[:X.shape[1]]))
y = pd.DataFrame(y, columns=['Target'])

# Split dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train Evolutionary Forest
r = EvolutionaryForestRegressor(max_height=5, normalize=True, select='AutomaticLexicase',
                                gene_num=10, boost_size=100, n_gen=20, n_pop=200, cross_pb=1,
                                base_learner='Random-DT', verbose=True, n_process=1)
r.fit(x_train, y_train)

from evolutionary_forest.utils import get_feature_importance, plot_feature_importance

code_importance_dict = get_feature_importance(r)
hengzhe-zhang commented 1 year ago

It took me some time to identify the reason for the issue. I discovered that the problem is caused by the naming convention used for parse_expr in sympy.

To illustrate this, let me provide an example where the first naming convention would raise an exception. However, the following two naming conventions are correct and should be used instead.

from sympy import parse_expr

try:
    print(parse_expr('QQ*QD'))
except:
    pass
try:
    print(parse_expr('Q1*Q2'))
except:
    pass
try:
    print(parse_expr('X1*X2'))
except:
    pass

At the moment, I am unsure how to avoid this problem. Therefore, I hope you to refrain from using the problematic naming convention. Thank you for your understanding.

minghao51 commented 1 year ago

I was wondering about what naming convention I should follow.

So, I guess it should be:

Would it make sense to have a check for it? or parse the feature_names into compatible ones?

hengzhe-zhang commented 1 year ago

I was wondering about what naming convention I should follow.

So, I guess it should be:

  • no double alphabets
  • no underscore/space etc

Would it make sense to have a check for it? or parse the feature_names into compatible ones?

That's great advice. I would definitely consider implementing a check in the future.