Open allard-jeff opened 4 years ago
@QuentinAmbard Have anyone else ran this code successfully?
That's almost the code from the unit test, so yes it should run without error. I'll try to debug that this week, maybe there is an issue with 0.32.0 ...
This also happen to me. It used to work with RandomForest in version 0.30.2.
I'll try to revert and check again ..
Hello, I`m having the same problem with Shap version: shap-0.32.1 I also tried previous version but I get the same error since the commit which added GBT and RF.
I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() And run it in my environment and I get the same error. Could be that the build was done without unit testing?
Hello, I`m having the same problem with Shap version: shap-0.32.1 I also tried previous version but I get the same error since the commit which added GBT and RF.
I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() And run it in my environment and I get the same error. Could be that the build was done without unit testing?
@caspiDoron You may try version 0.30.2. it works for me.
Hello, I`m having the same problem with Shap version: shap-0.32.1 I also tried previous version but I get the same error since the commit which added GBT and RF. I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() And run it in my environment and I get the same error. Could be that the build was done without unit testing?
@caspiDoron You may try version 0.30.2. it works for me.
Thanks but it seems to work only for DT, Random forest failing on the tests: AssertionError: SHAP values don't sum to model output for class0!
GBT is not supported which is the one i use...
I just re-run the unit test and something is broken indeed. As a workaround you can set check_additivity=False when computing the shap_values
It's a new check that has been added and calls the predict function. I suspect this hasn't been catch in the unit tests because spark isn't in the env and the test is ignored in this case.
Thank you @QuentinAmbard it is working with this workaround.
Great! I suggest we do the following:
Thanks for checking into this @QuentinAmbard! I just pushed an updated tolerance check for additivity to address #887, but I suspect this might be a true error that this new check uncovered. Happy to help work through it on the PR
Hi. I can run the example code properly. But I'm not fully understand how shap_values works actually. Can anyone please explain why shap_values takes 'X' as data in from of features in each column (i.e. _sepal_length, sepal_width, petal_length, petalwidth, separately in each column), while GBTClassifier model actually takes features in 1 column (named 'features').
Why shap_values can understand the difference between when the model was trained (features in 1 column) and when the model was to be explained?
Thank you sir.
shap_values takes a pandas Dataframe containing one column per feature. GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)
Once the model is trained shap will explain it using shap_values(...).
You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.
shap_values takes a pandas Dataframe containing one column per feature. GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)
Once the model is trained shap will explain it using shap_values(...).
You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.
Thanks @QuentinAmbard. I still wonder how shap_values knows that each column in Pandas Dataframe equal to which element of Spark Dataframe (when the model was trained).
I'm using the index of the features, I assume the order of the pandas column must be the same as the features added in the vector assembler of your spark dataframe. Probably worth mentioning it in the documentation.
https://github.com/slundberg/shap/blob/master/shap/explainers/tree.py#L951
Hi All, Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?
@QuentinAmbard Using iris example i try to parallelize shap values calculation in this way:
iris_shap=iris.drop('type','features','label').repartition(10)
X_columns=iris_shap.columns
explainer = shap.TreeExplainer(model)
def calculate_shap(rows,X_columns,explainer):
a=pd.DataFrame(rows,columns=X_columns)
shap_values = explainer.shap_values(a)
return [Row(*( [float(f) for f in shap_values[i]])) for i in range(len(shap_values))]
iris_shap.rdd.mapPartitions(lambda j:calculate_shap(j,X_columns,explainer)).toDF(X_columns)
if model is sklearn.ensemble.GradientBoostingClassifier no problem but when is equal to pyspark.ml.classification.GBTClassifier obtain this error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o135.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Any suggestion?
The explainer can't be serialized probably because we are keeping spark references inside. I'll try to have a look. As workaround you can recompute the explainer in each partition maybe ?
i try passing model to function calculate_shap and compute the explainer inside partition but obtain the same error
@QuentinAmbard
I am trying to get shap to work with a pyspark GBT classifier. I got my features as a numpy array X and then tried (as in the example):
>>>
model = pyspark.ml.classification.GBTClassificationModel.load("/path/to/trained/model")>>>
explainer = shap.TreeExplainer(model) Setting feature_perturbation = "tree_path_dependent" because no background data was given.>>>
sv = explainer.shap_values(X)
It gave the following error:
Traceback (most recent call last): File "
", line 1, in File " /lib/python3.7/site-packages/shap/explainers/tree.py", line 304, in shap_values assert self.model.fully_defined_weighting, "The background dataset you provided does not cover all the leaves in the model, " \ AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".
I did not provide a background dataset, so I don't understand why it wants me to provide a larger one. Also, the matrix X contains my entire training dataset, so I don't understand how it could not cover all the the leaves in the model. Am I doing something obviously wrong?
Then, when I tried using feature_perturbation="interventional", it gave a different error:
>>>
explainer = shap.TreeExplainer(model, data=X) Traceback (most recent call last): File "", line 1, in File " /lib/python3.7/site-packages/shap/explainers/tree.py", line 151, in init self.expected_value = self.model.predict(self.data).mean(0) File " /lib/python3.7/site-packages/shap/explainers/tree.py", line 972, in predict raise NotImplementedError("Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.") NotImplementedError: Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.
Also, if running the predictions with spark is complicated to implement, it might be worth adding the ability of the user to supply the expected predictions for validation.
You should get this error when your tree is built with a leaf without data inside. If you get this error, I assume you are using shap on a model built with a small data size? Can you open another issue to track the implementation of the predictions with spark to make it works with interventional ?
You mean that not all paths to the leaf are the same length, so some of what would otherwise be intermediate nodes have no children (leaving those phantom child nodes empty)?
Also, I created Issue #1192 to track spark predictions.
@QuentinAmbard
Are there news about the error of @amandolesi? Any new suggestion?
Thank you.
Hi All, Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?
I run into the same problem, did you find out the solution to it?
I didn't had time to search what's causing this exactly these last weeks. I suspect there is a reference to spark kept somewhere and it breaks the serialization of the tree explainer with a spark model. I'll have a look when I got some time, but it shouldn't be a big deal, especially if the serialisation is working with other models.
@QuentinAmbard
Hello Quentin, to recap and also give you some additional feedback, I performed some tests using a local standalone instance of spark.
As you mentioned a serialization error, I tried pickling a 'pyspark.ml.classification.RandomForestClassificationModel' object, basically a fitted pyspark random forest and I got a Py4J error, the same that @amandolesi reported above.
In explainer/tree.py, TreeExplainer class, row 695:
elif "pyspark.ml" in str(type(model)):
assert_import("pyspark")
self.original_model = model
so this serialization problem propagates. I tried commenting out "self.original_model = model" and I was then able to pickle the TreeExplainer object with a PySpark model. Of course it is a workaround but predictions are not implemented with PySpark yet, so commenting that line for the time being should not be an issue, what do you think about it?
Thanks @guidiandrea ! Absolutely that's what I had in mind too, but still haven't find time to do the change :/ The original_model was indeed kept in order to implement predictions (https://github.com/slundberg/shap/issues/1192) but I think we should find another way to avoid breaking serialisation with spark models. Would you like to do the PR ?
Here you are: https://github.com/slundberg/shap/pull/1307
Thank you @QuentinAmbard!
@slundberg I think we can now close this issue as everything should be solved with #1313
Looks like this should be closed @allard-jeff
I am still having `NotImplementedError: CategoricalSplit are not yet implemented" error. I am using shap==0.39.0 and Spark 3
I also got the same error using decision trees as well.
Code:
pipeline = Pipeline(stages=[string_indexer, vector_assembler, model])
pipeline_model = pipeline.fit(train_df)
explainer = shap.explainers.Tree(pipeline_model.stages[-1])
@chengyin38 The issue is that Shap can not handle categorical splits. So, in the Pyspark pre-processing you really need to drop the meta data from the data frame that pyspark will implicitly use to determine a feature is a categorical variable. This is done it seems as df= df.rdd.toDF(). String indexing without categorical splits might not be an optimal approach to the modeling however.
I will also note that it was necessary in my experience to remove the vectors of the one hot encoding. I broke them out into binary features. After that (along with the step above), Shap was able to run effectively on a Pyspark tree model.
@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?
assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".
Any idea how to fix this?
Model type not yet supported by TreeExplainer: <class ‘sparkdl.xgboost.xgboost_core.XgboostRegressorModel’>
I am trying to run this type of code:
xgboost = XgboostRegressor(**params)
pipeline = Pipeline(stages=[stringIndexer, vecAssembler, xgboost])
pipelineModel = pipeline.fit(trainDF)
explainer = shap.TreeExplainer(pipelineModel.stages[-1])
Update:
shap.TreeExplainer(pipelineModel.stages[-1].get_booster())
does the trick!
@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?
assert self.model.fully_defined_weighting, "The background dataset you provided does " \ AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".
Any progress with this problem? @QuentinAmbard
This issue has been inactive for two years, so it's been automatically marked as 'stale'.
We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open.
If there's no activity in the next 90 days the issue will be closed.
@QuentinAmbard
I just installed Shap from PyPi (0.32.0) and running a version of your test still produces the same error - shown below. Is there something that I am missing in the use of Shap with a pyspark model?