jjbrophy47 / tree_influence

Influence Estimation for Gradient-Boosted Decision Trees
Apache License 2.0
26 stars 10 forks source link

[bug] training xgboost dosen't work with dataframe, only numpy array #1

Open Yarden234 opened 2 years ago

Yarden234 commented 2 years ago

Hello and thanks you for that package. I came across a problem while trying to use a xgboost model that was trained on dataframe. So this is my code:

X_train, X_test, y_train, y_test = load_csv('X_train'), load_csv('X_test'), load_csv('y_train'), load_csv('y_test')
model = XGBClassifier(tree_method='hist')
X_train_val, y_train_vals = X_train.values, y_train.values.squeeze()
X_test_val, y_test = X_test.values, y_test.values.squeeze()
model.fit(X_train, y_train)

# fit influence estimator
explainer = BoostIn().fit(model, X_train, y_train)

Which produce this exception:

Traceback (most recent call last):
  File "/home/jupyter/owlytics-data-science/influence/influence.py", line 35, in <module>
    explainer = BoostIn().fit(model, X_train, y_train)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/boostin.py", line 44, in fit
    super().fit(model, X, y)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/base.py", line 31, in fit
    self.model_ = parse_model(model, X, y)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/parsers/__init__.py", line 33, in parse_model
    trees, params = parse_xgb_ensemble(model)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/parsers/parser_xgb.py", line 17, in parse_xgb_ensemble
    trees = np.array([_parse_xgb_tree(tree_str) for tree_str in string_data], dtype=np.dtype(object))
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/parsers/parser_xgb.py", line 17, in <listcomp>
    trees = np.array([_parse_xgb_tree(tree_str) for tree_str in string_data], dtype=np.dtype(object))
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/parsers/parser_xgb.py", line 88, in _parse_xgb_tree
    node_dict = _parse_line(line)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/parsers/parser_xgb.py", line 190, in _parse_line
    res['feature'], res['threshold'] = _parse_decision_node_line(line)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/tree_influence/explainers/parsers/parser_xgb.py", line 201, in _parse_decision_node_line
    feature_ndx = int(feature_str[1:])
ValueError: invalid literal for int() with base 10: 'ecent_beta_blockers_change'

However, When training X_train_val, y_train_val (which is a numpy array) works perfectly good. It would be great if you could support training with DataFrame as well. Thanks again!

jjbrophy47 commented 2 years ago

Hi Yarden234! Thanks for bringing this up. I believe I've fixed this issue now in version 0.1.1. Please give it a try and feel free to open this issue back up if it's not working. Thanks again!

aclarkse commented 7 months ago

Hi there,

I encountered this error still. I was wondering if you might check on it again. Thanks!

jjbrophy47 commented 5 months ago

Hi @aclarkse, can you provide a fully reproducible example, please?