Open ozika opened 1 year ago
Hi, thanks for getting in contact and offering help.
This is a good reference for ICE (and other methods) https://christophm.github.io/interpretable-ml-book/ice.html
plot_dependence
has a var_idx
argument that you can use to exclude variables by index. We may extend it to work with variable names, instead.
We can use ArviZ and the returned InferenceData in general (like for sampling diagnostics), but for the PDP and ICE plots we need to compute new predictions and for that, we need the fitted trees that are stored in the BART variable and not in the InferenceData.
Thanks for your response!
I am not sure if the var_idx
filtering is what I mean. For example, one might want to plot temperature impact estimates separately for working day and weekend (to explore bike_rentals ~ temperature*workingday
interaction). Same with the example in the Interpretable ML book - it just shows us posterior samples of hypothetical individuals, but we don't know the properties of the individuals - which, in my mind, is what one often wants to know.
I will have a go at using ArviZ and InferenceData and post here.
In arviz if you use var_names
or filter_vars
you are just selecting subsets of variables. But still, the results are dependent on all variables. So for a model like y ~ a+b+c
and you select a
you are just omitting b
and c
.
For PDP we are plotting y
vs a
by averaging the effects of all the rest of the variables, in this case b
and c
. ICE is similar but we keep individual observations.
I think what you want is to fit y ~ a+b+c
, but then approximate y* ~ a+b
i.e. as if we have never used c
as part of the model. For that, one possible approximation is to prune the trees and remove those branches including the variable c
. We currently do this to estimate variable importance. But maybe we can extend plot_dependence
to exclude variables. Although this will need some empirical testing before using for some real example to test if we get reasonable results.
I think variable selection would be useful to determine the predictive power of a variable as a whole, what I am after is examining how do predictors influence outcomes. In this example it would be y* ~ a+b+(c==0)
vs y* ~ a+b+(c==1)
.
I am realizing that the 'plot_posterior' example was wrong, it does only filter variables, it's been a while since I used pymc. I think a better example would be this.
These are a few approaches to estimating how do predictors influence outcomes using PyMC-BART
Fitting only one model:
Fitting more than one model
Another potential approach could be to do something like https://github.com/yannmclatchie/kulprit, but we don't have a theory for BART for that (variable importance is only loosely inspired on that)
Thank you for your response (and patience!).
PDP/ICE. PDP gives reasonable results when there is little interaction between variables. Why do you consider this not enough for your problem?
Correct me if I am not getting it right - I think it's because (at least in my field) one often wants to understand the interactions, not select the variables. Following from the example on the tutorial, I plotted it using the ICE method.
Focusing on humidity, one can see that there is some variability (tree paths), however it's not clear whether this is caused by influence of hour
, temprature
or workingday
- or is this a step that I am getting wrong?
Thank you for pointing out the kulprit package! I have actually been looking for exactly that in Python for a while :)
Focusing on humidity we can see that the pattern is essentially the same for all instances, it's just shifted up or down from the mean. This shows that there are no interactions (or at least we can not detect them). In other words, no matter at which values we fix the rest of the variables the effect of humidity on the rental of bikes seems to be the same. Flat (or slightly negative slope) at the begging followed by a slightly steeper (and negative) slope for a humidity higher than ~0.6.
Understanding interactions is very relevant for us too. We have some ideas for making it more straightforward for users to do that, but unfortunately, we are still in the early development stage and we also need to test those ideas. Let me know if you are interested in testing those ideas on your own datasets and I will contact you when we have something ready.
Thanks!
Sure, I'd definitely be happy to try things on some of my datasets :)
Hi! Thanks for creating this great package :)
I think one important aspect of understanding models is the ability to explore conditional posteriors. In the tutorial you mention the
kind="ice"
option, however, it is unclear how this can be used to systematically understand the model posterior. In arviz I'd for example use theplot_posterior()
with thefilter_vars
argument to explore interactions. Is there a similar way inpmc.plot_dependance()
? Or, can one easily use arviz with estimatedInferenceData
object?I think this would be a very handy addition to the documentation. Once I understand it I'd be happy to write an example.