shap / shap

A game theoretic approach to explain the output of any machine learning model.
https://shap.readthedocs.io
MIT License
21.96k stars 3.2k forks source link

Error with Pyspark GBTClassifier #884

Open allard-jeff opened 4 years ago

allard-jeff commented 4 years ago

@QuentinAmbard

I just installed Shap from PyPi (0.32.0) and running a version of your test still produces the same error - shown below. Is there something that I am missing in the use of Shap with a pyspark model?

import pyspark
print(pyspark.__version__)
import shap
print(shap.__version__)
import sklearn.datasets
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier, GBTClassifier
import pandas as pd

iris_sk = sklearn.datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris_sk['data'], iris_sk['target']], columns= iris_sk['feature_names'] + ['target'])[:100]
spark = SparkSession.builder.config(conf=SparkConf().set("spark.master", "local[*]")).getOrCreate()

col = ["sepal_length","sepal_width","petal_length","petal_width","type"]
iris = spark.createDataFrame(iris, col)
iris = VectorAssembler(inputCols=col[:-1],outputCol="features").transform(iris)
iris = StringIndexer(inputCol="type", outputCol="label").fit(iris).transform(iris)

classifier = GBTClassifier(labelCol="label", featuresCol="features")
model = classifier.fit(iris)
explainer = shap.TreeExplainer(model)
X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100] # pylint: disable=E1101
shap_values = explainer.shap_values(X)

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-31-f47b3a56c25f> in <module>
     23 explainer = shap.TreeExplainer(model)
     24 X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100] # pylint: disable=E1101
---> 25 shap_values = explainer.shap_values(X)

/mnt1/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in shap_values(self, X, y, tree_limit, approximate, check_additivity)
    283 
    284         if check_additivity and self.model_output == "margin":
--> 285             self.assert_additivity(out, self.model.predict(X))
    286 
    287         return out

/mnt1/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in predict(self, X, y, output, tree_limit)
    785             import pyspark
    786             #TODO support predict for pyspark
--> 787             raise NotImplementedError("Predict with pyspark isn't implemented")
    788 
    789         # see if we have a default tree_limit in place.

NotImplementedError: Predict with pyspark isn't implemented
allard-jeff commented 4 years ago

@QuentinAmbard Have anyone else ran this code successfully?

QuentinAmbard commented 4 years ago

That's almost the code from the unit test, so yes it should run without error. I'll try to debug that this week, maybe there is an issue with 0.32.0 ...

Ekkalak-T commented 4 years ago

This also happen to me. It used to work with RandomForest in version 0.30.2.

I'll try to revert and check again ..

caspiDoron commented 4 years ago

Hello, I`m having the same problem with Shap version: shap-0.32.1 I also tried previous version but I get the same error since the commit which added GBT and RF.

I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() And run it in my environment and I get the same error. Could be that the build was done without unit testing?

Ekkalak-T commented 4 years ago

Hello, I`m having the same problem with Shap version: shap-0.32.1 I also tried previous version but I get the same error since the commit which added GBT and RF.

I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() And run it in my environment and I get the same error. Could be that the build was done without unit testing?

@caspiDoron You may try version 0.30.2. it works for me.

caspiDoron commented 4 years ago

Hello, I`m having the same problem with Shap version: shap-0.32.1 I also tried previous version but I get the same error since the commit which added GBT and RF. I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() And run it in my environment and I get the same error. Could be that the build was done without unit testing?

@caspiDoron You may try version 0.30.2. it works for me.

Thanks but it seems to work only for DT, Random forest failing on the tests: AssertionError: SHAP values don't sum to model output for class0!

GBT is not supported which is the one i use...

QuentinAmbard commented 4 years ago

I just re-run the unit test and something is broken indeed. As a workaround you can set check_additivity=False when computing the shap_values

It's a new check that has been added and calls the predict function. I suspect this hasn't been catch in the unit tests because spark isn't in the env and the test is ignored in this case.

caspiDoron commented 4 years ago

Thank you @QuentinAmbard it is working with this workaround.

QuentinAmbard commented 4 years ago

Great! I suggest we do the following:

  1. Create a small fix to disable check_additivity for spark models (I'll commit that soon as a fix to this issue)
  2. Make sure the tests are launched with the spark lib in the env to prevent from this kind of issues (will create a new issue to fix that)
  3. More long term / viable: implement the prediction with spark (I'll create a new feature too)
slundberg commented 4 years ago

Thanks for checking into this @QuentinAmbard! I just pushed an updated tolerance check for additivity to address #887, but I suspect this might be a true error that this new check uncovered. Happy to help work through it on the PR

ppakawatk commented 4 years ago

Hi. I can run the example code properly. But I'm not fully understand how shap_values works actually. Can anyone please explain why shap_values takes 'X' as data in from of features in each column (i.e. _sepal_length, sepal_width, petal_length, petalwidth, separately in each column), while GBTClassifier model actually takes features in 1 column (named 'features').

Why shap_values can understand the difference between when the model was trained (features in 1 column) and when the model was to be explained?

Thank you sir.

QuentinAmbard commented 4 years ago

shap_values takes a pandas Dataframe containing one column per feature. GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)

Once the model is trained shap will explain it using shap_values(...).

You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.

ppakawatk commented 4 years ago

shap_values takes a pandas Dataframe containing one column per feature. GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)

Once the model is trained shap will explain it using shap_values(...).

You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.

Thanks @QuentinAmbard. I still wonder how shap_values knows that each column in Pandas Dataframe equal to which element of Spark Dataframe (when the model was trained).

QuentinAmbard commented 4 years ago

I'm using the index of the features, I assume the order of the pandas column must be the same as the features added in the vector assembler of your spark dataframe. Probably worth mentioning it in the documentation.

https://github.com/slundberg/shap/blob/master/shap/explainers/tree.py#L951

sacmax commented 4 years ago

Hi All, Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?

amandolesi commented 4 years ago

@QuentinAmbard Using iris example i try to parallelize shap values calculation in this way:

iris_shap=iris.drop('type','features','label').repartition(10)
X_columns=iris_shap.columns
explainer = shap.TreeExplainer(model)

def calculate_shap(rows,X_columns,explainer):
  a=pd.DataFrame(rows,columns=X_columns)
  shap_values = explainer.shap_values(a)
  return [Row(*( [float(f) for f in shap_values[i]])) for i in range(len(shap_values))]

iris_shap.rdd.mapPartitions(lambda j:calculate_shap(j,X_columns,explainer)).toDF(X_columns)

if model is sklearn.ensemble.GradientBoostingClassifier no problem but when is equal to pyspark.ml.classification.GBTClassifier obtain this error:

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o135.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Any suggestion?

QuentinAmbard commented 4 years ago

The explainer can't be serialized probably because we are keeping spark references inside. I'll try to have a look. As workaround you can recompute the explainer in each partition maybe ?

amandolesi commented 4 years ago

i try passing model to function calculate_shap and compute the explainer inside partition but obtain the same error

annagarkar commented 4 years ago

@QuentinAmbard

I am trying to get shap to work with a pyspark GBT classifier. I got my features as a numpy array X and then tried (as in the example):

>>> model = pyspark.ml.classification.GBTClassificationModel.load("/path/to/trained/model") >>> explainer = shap.TreeExplainer(model) Setting feature_perturbation = "tree_path_dependent" because no background data was given. >>> sv = explainer.shap_values(X)

It gave the following error:

Traceback (most recent call last): File "", line 1, in File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 304, in shap_values assert self.model.fully_defined_weighting, "The background dataset you provided does not cover all the leaves in the model, " \ AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

I did not provide a background dataset, so I don't understand why it wants me to provide a larger one. Also, the matrix X contains my entire training dataset, so I don't understand how it could not cover all the the leaves in the model. Am I doing something obviously wrong?

Then, when I tried using feature_perturbation="interventional", it gave a different error:

>>> explainer = shap.TreeExplainer(model, data=X) Traceback (most recent call last): File "", line 1, in File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 151, in init self.expected_value = self.model.predict(self.data).mean(0) File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 972, in predict raise NotImplementedError("Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.") NotImplementedError: Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.

Also, if running the predictions with spark is complicated to implement, it might be worth adding the ability of the user to supply the expected predictions for validation.

QuentinAmbard commented 4 years ago

You should get this error when your tree is built with a leaf without data inside. If you get this error, I assume you are using shap on a model built with a small data size? Can you open another issue to track the implementation of the predictions with spark to make it works with interventional ?

annagarkar commented 4 years ago

You mean that not all paths to the leaf are the same length, so some of what would otherwise be intermediate nodes have no children (leaving those phantom child nodes empty)?

Also, I created Issue #1192 to track spark predictions.

MatteoManzari commented 4 years ago

@QuentinAmbard

Are there news about the error of @amandolesi? Any new suggestion?

Thank you.

jennyivy commented 3 years ago

Hi All, Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?

I run into the same problem, did you find out the solution to it?

QuentinAmbard commented 3 years ago

I didn't had time to search what's causing this exactly these last weeks. I suspect there is a reference to spark kept somewhere and it breaks the serialization of the tree explainer with a spark model. I'll have a look when I got some time, but it shouldn't be a big deal, especially if the serialisation is working with other models.

guidiandrea commented 3 years ago

@QuentinAmbard

Hello Quentin, to recap and also give you some additional feedback, I performed some tests using a local standalone instance of spark.

As you mentioned a serialization error, I tried pickling a 'pyspark.ml.classification.RandomForestClassificationModel' object, basically a fitted pyspark random forest and I got a Py4J error, the same that @amandolesi reported above.

In explainer/tree.py, TreeExplainer class, row 695:

elif "pyspark.ml" in str(type(model)):
            assert_import("pyspark")
            self.original_model = model

so this serialization problem propagates. I tried commenting out "self.original_model = model" and I was then able to pickle the TreeExplainer object with a PySpark model. Of course it is a workaround but predictions are not implemented with PySpark yet, so commenting that line for the time being should not be an issue, what do you think about it?

QuentinAmbard commented 3 years ago

Thanks @guidiandrea ! Absolutely that's what I had in mind too, but still haven't find time to do the change :/ The original_model was indeed kept in order to implement predictions (https://github.com/slundberg/shap/issues/1192) but I think we should find another way to avoid breaking serialisation with spark models. Would you like to do the PR ?

guidiandrea commented 3 years ago

Here you are: https://github.com/slundberg/shap/pull/1307

Thank you @QuentinAmbard!

QuentinAmbard commented 3 years ago

@slundberg I think we can now close this issue as everything should be solved with #1313

antonwnk commented 2 years ago

Looks like this should be closed @allard-jeff

chengyin38 commented 2 years ago

I am still having `NotImplementedError: CategoricalSplit are not yet implemented" error. I am using shap==0.39.0 and Spark 3

I also got the same error using decision trees as well.

Code:

pipeline = Pipeline(stages=[string_indexer, vector_assembler, model])
pipeline_model = pipeline.fit(train_df)
explainer = shap.explainers.Tree(pipeline_model.stages[-1])
AllardJM commented 2 years ago

@chengyin38 The issue is that Shap can not handle categorical splits. So, in the Pyspark pre-processing you really need to drop the meta data from the data frame that pyspark will implicitly use to determine a feature is a categorical variable. This is done it seems as df= df.rdd.toDF(). String indexing without categorical splits might not be an optimal approach to the modeling however.

AllardJM commented 2 years ago

I will also note that it was necessary in my experience to remove the vectors of the one hot encoding. I broke them out into binary features. After that (along with the step above), Shap was able to run effectively on a Pyspark tree model.

AllardJM commented 2 years ago

@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?

    assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".
AnastasiaProkaieva commented 2 years ago

Any idea how to fix this?

Model type not yet supported by TreeExplainer: <class ‘sparkdl.xgboost.xgboost_core.XgboostRegressorModel’>

I am trying to run this type of code:

xgboost = XgboostRegressor(**params)
pipeline = Pipeline(stages=[stringIndexer, vecAssembler, xgboost])
pipelineModel = pipeline.fit(trainDF)
explainer = shap.TreeExplainer(pipelineModel.stages[-1])

Update: shap.TreeExplainer(pipelineModel.stages[-1].get_booster()) does the trick!

BDon-Tan commented 2 years ago

@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?

    assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

Any progress with this problem? @QuentinAmbard

github-actions[bot] commented 2 months ago

This issue has been inactive for two years, so it's been automatically marked as 'stale'.

We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open.

If there's no activity in the next 90 days the issue will be closed.