smarie / python-m5p

An implementation of M5 and model trees in python, compliant with scikit-learn.
https://smarie.github.io/python-m5p/
BSD 3-Clause "New" or "Revised" License
20 stars 4 forks source link

linear model equations missing in export_text_m5 output #18

Open rootsmusic opened 8 months ago

rootsmusic commented 8 months ago

I'm running the line export_text_m5(reg.best_estimator_, out_file=None, node_ids=True). In an older version of your package, leaves with params>1 included linear model equations (e.g. LM1, LM2). Why are the equations not showing in the latest version of your package? Thanks.

smarie commented 8 months ago

Thanks a lot @rootsmusic ! Could it be related to a change in scikit-learn, silently caught in export_text_m5 ?

If you have a bit of time to investigate that would be greatly appreciated. Otherwise I'll shoot for "best effort" in the upcoming weeks

rootsmusic commented 8 months ago

(@smarie I'm unable to investigate, because I"m a Python novice.) You're probably right. I'm taking Professor Brooks' online course, which credited you. His notebook used scikit-learn 0.24.1, and his cell was:

# GridSearch comes in a cross validation variety, so let's import that
from sklearn.model_selection import GridSearchCV

# Now, let's set a few different hyperparameters the M5Prime class can work with
# I'm going to choose to explore a few different depths, a few minimum number of
# samples per leaf, and a few pruning options
parameters={'max_depth':(3,4,5,6), 
            'min_samples_leaf':(1,3,6),
            'use_pruning':[False,True],
            }

# Now we can just train our model as if it were a regression model directly. Be
# aware that this will take a bit of time to run
reg=GridSearchCV(estimator=M5Prime(use_smoothing=False), param_grid=parameters, cv=10, scoring='r2')
reg.fit(X_train.values,y_train.values)

# Ok, that was a lot to talk about. The tree is just part of the analysis though, we
# also have those regression equations at each leaf node. Recall that a regression
# equation is a bunch of coefficients, one for each feature, that are effectively a
# weighting which when summed together will produce a target value - in this case our
# percentage of votes. Now we can get these equations in a few ways, but Sylvain has nicely
# included a function which prints out the tree nodes and the linear model equations for
# us as well.

%run m5p.py
print(export_text_m5(reg.best_estimator_, out_file=None, node_ids=True))
for i,v in enumerate(X_train.columns):
    print(f"{i}: {v}")

I'm running his notebook in scikit-learn 1.3.1, and I've replaced the first line with %run export.py. However, the output is missing the equations.

smarie commented 8 months ago

Thanks a lot @rootsmusic ! I'll leverage this to have a look when I've got a bit of time.