micahmelling / auto-shap

MIT License
11 stars 1 forks source link

Errors while generating shapley values with classifiers #5

Open XavB64 opened 2 months ago

XavB64 commented 2 months ago

Hello,

I found that the first example with the ExtraTreesClassifier() doesn't work: "ValueError: Shape of passed values is (30, 2), indices imply (30, 30)"

It seems the library works for regression only but not for classification

Do you have any recommendation for that ?

Regards

micahmelling commented 2 months ago

Hi there,

Thanks for reaching out. Sorry to hear you are having issues. I am able to get auto-shap to work with classification models, including the one in the documentation example.

Could you provide more details about the dataframe you are passing in and where the error is exactly occurring?

Best, Micah

XavB64 commented 2 months ago

Hello,

Thank you for your reply !

I'm using python 3.9.13, the autoshap version is the latest 0.3.2, and pandas 2.2.2

I'm running the first documentation example:

>>> from auto_shap.auto_shap import generate_shap_values
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> x, y = load_breast_cancer(return_X_y=True, as_frame=True)
>>> model = ExtraTreesClassifier()
>>> model.fit(x, y)
>>> shap_values_df, shap_expected_value, global_shap_df = generate_shap_values(model, x)

Note that the second example with regression works without any issue.

The error on the 1st example is occuring on the last line of code when calling 'generate_shap_values'. Here is the complete error message:

{
    "name": "ValueError",
    "message": "Shape of passed values is (600, 2), indices imply (600, 30)",
    "stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\\AppData\\Local\\Temp\\ipykernel\\74634.py in <module>
      2 model = ExtraTreesClassifier()
      3 model.fit(x, y)
----> 4 shap_values_df, shap_expected_value, global_shap_df = generate_shap_values(model, x)

c:\\Users\\anaconda3\\lib\\site-packages\\auto_shap\\auto_shap.py in generate_shap_values(model, x_df, linear_model, tree_model, boosting_model, calibrated_model, regression_model, voting_or_stacking_model, use_agnostic, n_jobs, sample_size, k)
    312             voting_or_stacking_model
    313         )
--> 314     shap_values_df, shap_expected_value, global_shap_df = produce_raw_shap_values(
    315         model, x_df, use_agnostic, linear_model, tree_model, calibrated_model, boosting_model, regression_model,
    316         voting_or_stacking_model, n_jobs, sample_size, k

c:\\Users\\anaconda3\\lib\\site-packages\\auto_shap\\auto_shap.py in produce_raw_shap_values(model, x_df, use_agnostic, linear_model, tree_model, calibrated_model, boosting_model, regression_model, voting_or_stacking_model, n_jobs, sample_size, k)
    248         else:
    249             if tree_model:
--> 250                 return produce_shap_output_with_tree_explainer(model, x_df, boosting_model, regression_model, False,
    251                                                                n_jobs=n_jobs)
    252             elif linear_model:

c:\\Users\\anaconda3\\lib\\site-packages\\auto_shap\\auto_shap.py in produce_shap_output_with_tree_explainer(model, x_df, boosting_model, regression_model, linear_model, return_df, n_jobs)
    123     global_shap_df = generate_shap_global_values(shap_values, x_df)
    124     if return_df:
--> 125         shap_values_df = make_shap_df(shap_values, x_df)
    126         return shap_values_df, shap_expected_value, global_shap_df
    127     else:

c:\\Users\\anaconda3\\lib\\site-packages\\auto_shap\\utilities.py in make_shap_df(shap_values, x_df)
    152     :return: dataframe of SHAP values
    153     \"\"\"
--> 154     return pd.DataFrame(shap_values, columns=list(x_df))
    155 
    156 

c:\\Users\\anaconda3\\lib\\site-packages\\pandas\\core\\frame.py in __init__(self, data, index, columns, dtype, copy)
    825                 )
    826             else:
--> 827                 mgr = ndarray_to_mgr(
    828                     data,
    829                     index,

c:\\Users\\anaconda3\\lib\\site-packages\\pandas\\core\\internals\\construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    334     )
    335 
--> 336     _check_values_indices_shape_match(values, index, columns)
    337 
    338     if typ == \"array\":

c:\\Users\\anaconda3\\lib\\site-packages\\pandas\\core\\internals\\construction.py in _check_values_indices_shape_match(values, index, columns)
    418         passed = values.shape
    419         implied = (len(index), len(columns))
--> 420         raise ValueError(f\"Shape of passed values is {passed}, indices imply {implied}\")
    421 
    422 

ValueError: Shape of passed values is (600, 2), indices imply (600, 30)"
}

Thank you very much for your help :)

micahmelling commented 1 month ago

Sorry for the delay! I did also have trouble under those package versions. I was able to get the example to run again with the below libraries.

I think the underlying issue is with changes to newer versions of dumpy and the underlying SHAP library. I will plan to address in an upcoming release I have on the docket (which should give better support to multiclass classification problems).

auto-shap==0.3.2 cloudpickle==3.0.0 contourpy==1.3.0 cycler==0.12.1 fonttools==4.54.1 importlib_resources==6.4.5 joblib==1.4.2 kiwisolver==1.4.7 llvmlite==0.43.0 matplotlib==3.9.2 numba==0.60.0 numpy==1.26.4 packaging==24.1 pandas==2.2.2 pillow==10.4.0 pyparsing==3.1.4 python-dateutil==2.9.0.post0 pytz==2024.2 scikit-learn==1.5.2 scipy==1.13.1 shap==0.44.0 six==1.16.0 slicer==0.0.7 threadpoolctl==3.5.0 tqdm==4.66.5 tzdata==2024.2 zipp==3.20.2