scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.23k stars 25.43k forks source link

Sensitivity Analysis function #22453

Open tupui opened 2 years ago

tupui commented 2 years ago

Proposal

Add a Sensitivity Analysis (SA) function.

The function would compute Sobol' indices [1,2]. Consider a function f with parameters x1, x2 and x3. Hence y=f(x1,x2,x3). We are interested to know which parameter has the most impact, in terms of variance, on the value y.

I believe scikit-learn, and the wider scientific community, would greatly benefit to have such tool. I believe scikit-learn has something related with feature_importances_ in some regressors.

As an expert in the field, I would propose to develop the initial functionalities and provide long term support.

Background

Sobol’ indices are the cornerstone of SA and UQ and its application is not restricted to any field. It’s a non intrusive method which makes the only assumption that the variables are independent (this constraint can be alleviated).

Being able to compute sensitivity indices allows to reduce the dimensionality of a problem, better understand the importance of each factors and also see how parameters are interacting with each other. As such, it’s an important engineering tool. If you have 2 or more variables and only have the budget to improve your knowledge on one of them, this can help to make an informed choice. There are a lot of successful usage of SA in the literature and in real world applications. The EU (through the JRC), is now requiring to conduce uncertainty analysis when evaluating a system. They are recommending the use of Sobol’ indices.

Example

Let’s take an example with the Ishigami function:

y = sin(x1) + 7*sin(x2)^2 + 0.1*x3^4*sin(x1)

It’s not obvious to tell which variable would impact more y.

The Sobol’ indices are bounded from 0 to 1, with 1 meaning more important. Here they would be:

Variable First order Sobol’ Total Sobol’
x1 0.31 0.56
x2 0.41 0.44
x3 0.00 0.24

The difference between the first and total indices indicate an interaction between variables. The total indices allow to rank the variable by importance. x1 is the most important. Looking at the first orders, x3 by itself does not have an impact on the variance of the output. It’s its combination with another variable which makes it have a total impact of 0.24. x1 in this case also have a difference in first and total indices while x2 is the same. We can say that x1 and x2 have a second order interaction.

Computing the indices requires a large sample size, to alleviate this constraint, a common approach is to construct a surrogate model with Gaussian Process or Polynomial Chaos (to name the most used strategies). This allows to conduct a SA with a low computational budget (we see lots of engineering applications with expensive HPC codes taking advantage of this strategy).

Implementation

Here is an example implementation I did a while back:

https://gist.github.com/tupui/09f065d6afc923d4c2f5d6d430e11696

Note that this implementation lack a few things such as higher order indices, other methods, input validation, doc, tests, other wrapping, etc. This is just to show how it works.

Alternatives

Practitioners might be more familiarized with gradient based technics. Keywords include: gradient, adjoint. These methods are local sensitivity analysis methods. And as such they are only capturing the sensitivity of the model in specific areas of the parameter space.

Sobol’ indices are variance based indices. There are other indices using higher moments, namely: moment independant based sensitivity analysis. These methods are very attractive and provide lot of information while being simple to compute/analyse. They have been less studied but there is an increasing interest in the community. I believe this would make a good second step.

References

Upon request, I can provide more information. Following are some fundational references–first 2 cited a few thousand times.

For some examples of industrial applications (Part. 3), a visual explanation of some methods (Chap. 2.2) or more bibliography, you can also see my thesis:

https://github.com/tupui/PHD-Thesis/blob/master/phd_thesis_pamphile_roy.pdf

lorentzenchr commented 2 years ago

@tupui Thanks for proposing this functionality. I would see it as part of the inspection module, another way of looking at feature importance.

I‘d be interested in the connection to other feature importance measures and whether it is possible to generalize it from "variance explained" to "metric/score explained", e.g., consider explaining a classifier.

tupui commented 2 years ago

Thank you for having a look @lorentzenchr

@tupui Thanks for proposing this functionality. I would see it as part of the inspection module, another way of looking at feature importance.

That sounds sensible.

I‘d be interested in the connection to other feature importance measures and whether it is possible to generalize it from "variance explained" to "metric/score explained", e.g., consider explaining a classifier.

I am not sure how to make the bridge. Although there is a direct link with sklearn.metrics.r2_score. If you make a model and you get a R2 (Q2 in my field) of 0.8, then your model explains 80% of the variance. If on top you compute Sobol' indices and it says one variable is responsible for 50% of the variance. It in fact can be said that this variable explains up to 50% of 80% of the model's variance.

tupui commented 2 years ago

@lorentzenchr I was wondering about the status here.

Also, I was thinking about your question a bit more and found this in the process https://hal.archives-ouvertes.fr/hal-03151611

Correct me if I am wrong, but feature importance is computed by looking at the impact a parameter has on the output metric of a model (improving R2 or a classification metric). Then we can say that a given parameter has more impact than another, but just for this modelling. This way a modeller can focus on a given parameter while tuning a model, etc. But choosing another kind of model might lead to a different hierarchy/importance since the inner working would be different. In practice, maybe it does not matter much.

On the other hand, sensitivity analysis does not care about modelling an only take into account the outcome of a system-or model in this case. It would not matter which type of model is used. This metric is useful to inform user, policy makers, etc. to know which parameter is important and they might want to focus their attention on.

TL;DR to me it's a modeller vs user of the model difference. Both are complementary, the modeller seek to improve its model focusing on some parameters, while the user want to understand which parameter impact the system itself.

lorentzenchr commented 2 years ago

I was wondering about the status here.

I think it needs a bit more discussion, in particular from domain experts. @glemaitre @thomasjpfan @GaelVaroquaux @adrinjalali @ogrisel may have a view w.r.t to the inspection module.

A short notebook with an example would help me a lot in understanding.

Note that there are different kinds of feature importances: Those measuring the drop in some performance/accuracy measure like R2. And those computing feature attribution to the predicted value(s) like SHAP. Global methods do it for a data sample, local ones for single observations. IIUC, sensitivity analysis can be viewed as a global method measuring drop in R2.

adrinjalali commented 2 years ago

The only thing I'm wary of, is that it assumes features are independent, and they pretty much never are. So in reality, this is only useful if we do alleviate those issues.

On the other hand, this can be said about other inspection tools we have I think.

This would make sense, if we could also give users a nice easy guide on how to choose the way they should be inspecting their model.

tupui commented 2 years ago

IIUC, sensitivity analysis can be viewed as a global method measuring drop in R2.

Considering y = f(x1, x2), sensitivity analysis is looking at whether x1 or x2 matters by looking at var(y|x_i)/var(y).

Which is different than measuring a variation on R2. Depending on your model, one parameter could matter more for R2 than it actually matter for var(f).

The only thing I'm wary of, is that it assumes features are independent, and they pretty much never are. So in reality, this is only useful if we do alleviate those issues.

It depends what you are talking about. Sobol' indices do capture parameters interactions. But it assumes that parameters are independently draws/distributed. So x1 varies in function of x2.

There are ways to deal with parameter dependence with Sobol'. But other methods might be more appropriate/easy. E.g. Shapley values or moment independent methods.

In my experience though, parameter independence does not represent the majority of cases.

This would make sense, if we could also give users a nice easy guide on how to choose the way they should be inspecting their model.

100% on more tutorials/doc on feature importance/sensitivity analysis.

tupui commented 1 year ago

I will close the issue. Meanwhile, I got this into SciPy: https://scipy.github.io/devdocs/reference/generated/scipy.stats.sobol_indices.html

I am bit sad no decision was made in such a long time (I am not blaming anyone here, we have the same issue in SciPy I am afraid). Thanks for the discussions though 😃

glemaitre commented 1 year ago

I would still let the issue open since the discussion happen. I personally did not get time to dedicate to catching up with the topic.

We usually have predefined conditions for inclusion: a couple of hundred citations and a 5+ years old research. Up to now, I don't have the impression that it was used mainstream so we would need to ease the inclusion criterion and thus be sure of what are the limitations of the method.

I will start to have a look at the SobolMDA for trees since it alleviates one of the caveats of the MDA.

API-wise, we are facing some legacy choices and we should probably go back to https://github.com/scikit-learn/scikit-learn/issues/9606. It could ease the inclusion of new "feature importance" methods.

tupui commented 1 year ago

Thanks @glemaitre

SobolMDA is a specific application of Sobol' to random forest (AFAIU, never looked into it). To be clear, here I was proposing the general Sobol' method.

For Sobol' indices, we are way passed the inclusion criteria. Books like Global sensitivity analysis: the primer have thousands of citations (this one mostly talk about Sobol' indices). And the original paper on Sobol' indices also has a few thousands citations.

You also have: international conferences/groups like SAMO, MASCOT-NUM, MCM ; Journals like IJUQ ; summer schools (SAMO) ; several labs groups ; a whole JRC group. The common denominator of all that is global sensitivity analysis and the most used method is Sobol' indices.

Real life example are all over the place. E.g. at EDF for nuclear safety, dam analysis or even flood modelling (LNHE, Schapi, ...). The JRC group I mentioned above also used this a lot to drive policies (CO2MPAS was improved with Sobol' indices).

On tools, you have multiple options: SALib , OpenTURNS (from EDF and used quite a bit in France), Dakota, uncertainpy, MUQ, tons of R packages (e.g. sensobol), UQLab (Matlab framework). And now SciPy.

This article is a good read about the state of SA https://doi.org/10.1016/j.envsoft.2020.104954

lorentzenchr commented 1 year ago

A short notebook with an example would help me a lot in understanding.

This would still be a first step in my opinion and should not be hard with sobol indices now in scipy. A simple dataset with a simple ML model „sensitivity analysed“ (or extending an existing example).

lorentzenchr commented 1 year ago

@tupui It is really not a lack of interest from my side, just limited knowledge and time.

tupui commented 1 year ago

@lorentzenchr no worries.

I am planning to do a tutorial at some point for SciPy. Meanwhile, I can suggest having a look at existing material such as:

https://salib.readthedocs.io/en/latest/ http://openturns.github.io/openturns/latest/auto_reliability_sensitivity/sensitivity_analysis/plot_sensitivity_sobol.html

And also the example in SciPy's doc.

If you have more questions, I would gladly try to answer them.