rstudio / vetiver-r

Version, share, deploy, and monitor models
https://rstudio.github.io/vetiver-r/
Other
179 stars 27 forks source link

Support for out-of-distribution detection with one class SVMs #222

Open aminadibi opened 1 year ago

aminadibi commented 1 year ago

One use case for vetiver would be monitoring whether new data that the model encounters in real world belongs to the same distribution of data seen during model training, or whether there is a covariate shift over time. I wonder if this would be a useful addition to vetiver.

We could potentially use one-class support vector machines for novelty detection, as proposed by Schölkopf et al. The idea is to define a frontier boundary surface around the training points in the p-dimensional feature space and check to see if new observations fall within or outside this frontier. Schölkopf et al suggest a single-class SVF to find the tightest hypersphere around the data. Probabilistic outputs from one-class SVMs can quantify the probability that the newly-observed data belongs to the distribution of training data.

In terms of implementation, LIBSVM provides the go-to implementation of one-class SVMs, and since version 3.31 (Feb. 2023) supports probabilistic outputs for one-class SVMs.

For vetiver-python, we could potentially use libsvm-official, which supports one-class probabilistic outputs. There is also sklearn.svm.OneClassSVM which seems to be a separate implementation inspired by libsvm, but I am not sure if it supports probabilistic outcomes.

For vetiver-r, the e1071 package provides an interface to LIBSVM but currently supports LIBSVM 3.23, which does not include probabilistic outputs. Another possibility is to use kernlab with type="one-svc", but to the best of my knowledge, kernlab does not produce probabilistic outcomes, and current predictions with one-class SVM with kernlab and tidymodels is buggy: https://github.com/tidymodels/parsnip/issues/974

What do you think @juliasilge?

juliasilge commented 1 year ago

This is pretty similar to the type of question that the applicable package approaches, and is definitely something that we want to show how to integrate with deploying a model. I have the need for documenting this in #68, and that should probably bump up the priority list sometime soon.

I think it's likely that we won't need to implement new features per se, but rather to show how to use existing functionality. For example, we can add a new article to "Learn more" at https://vetiver.rstudio.com/, showing how to add a new endpoint to a vetiver model API, returning a score from a one-class SVM or applicability model.