maciejkula / rustlearn

Machine learning crate for Rust
Apache License 2.0
622 stars 54 forks source link

Support conformal prediction #13

Closed UserAB1236872 closed 8 years ago

UserAB1236872 commented 8 years ago

My grad work deals a lot with conformal prediction, and there are few, if any, open source libraries that deal with it. I've been doing some work in it on Rust and, if it's okay, I'd like to add code to get some conformal prediction implementations out in the wild.

I realize there are more pressing things to consider, and much more widely used things that you probably want to support first, but I'd like to put it out there.

Some of the library needs to be structured to deal with it, unfortunately, but it's not too intrusive. For a classifier to work with CP, it needs to be able to provide a "conformal score", which could be part of a trait. While theoretically any classifier can be used with CP, the scores are classifier-specific. Logistic Regression is fine, and just requires the weights.

LibSVM isn't sufficient to implement CP over an SVM since you need to gather the support vector coefficients from a solved QP problem, but it's not a big deal if not all classifiers we have support it.

maciejkula commented 8 years ago

I must admit I have never heard about conformal prediction before. Could you sketch out how an implementation for logistic regression would work, and why it works?

In principle there is no reason why this couldn't be implemented as an additional trait, but I want to understand how this works first (and why it is useful).

UserAB1236872 commented 8 years ago

I recommend skimming the paper I linked, but Conformal Prediction is a region predictor. You can implement a conformal predictor over pretty much any classifier. Instead of returning a single label, conformal predictors are given a threshold and return a set of labels. The threshold is the P-value the label has to meet to be in the set.

For instance, a 95% (.05 threshold) conformal predictor will return a set such that there is a 95% chance that the true label is in the returned set. Obviously a "safe" conformal predictor can always trivially meet its criteria by returning every label. Generally speaking, the better the underlying classifier and model, the fewer "just to be safe" labels the returned set includes. This is useful for more safe, conservative prediction than traditional classifiers, at the cost that you don't know the true label for sure, just that it's one of the labels in the set.

I'd likely implement a Cross-Conformal predictor because it has a good accuracy/speed tradeoff, but I'll describe a traditional one for example just because it's easier:

Given some training set of points [(data1,label1),...] and some new point data_new, you can calculate its conformal set with logistic regression like this (pardon the half Rust half pseudocode):

let mut labels = Vec::new();

for label in {0,1} // Logistic regression only works with CP in binary cases
    add (data_new,label) to training_data
    let mut alphas = Vec::new()

    for (data,label) in training_data
        remove (data,label) from training_data
        train a logistic regression classifier on training_data
        let exponent = classifier.weights().dot(data);
        let alpha = 1.0 + if label == 1 { exp(-exponent) } else { exp(exponent) };
       alphas.push(alpha)

   // Calculate p value of (data_new,label)
   let alpha_n = alphas[alphas.len()-1];
   let p_val = alphas.iter().fold(0.0, |acc,&alpha| if alpha >= alpha_n { acc+1.0 } else { acc } / alphas.len() as f64;

    if p_val >= threshold { labels.push(label); }

labels
maciejkula commented 8 years ago

So in the example above alpha_n denotes the alpha of the newly added example? And you maintain a distribution of the alpha scores for each class, and then if alpha_n is an unlikely draw from that distribution you add it to a prediction set?

You seem to train the classifier in the inner loop of the algorithm above, is that necessary?

UserAB1236872 commented 8 years ago

You actually have to retrain the classifier every time because the new example is part of the training set for the classifier. For every data point in your "real training set", you withold it, train the classifier, and calculate its alpha score.

Yes, this is extremely slow. There are other methods called Inductive Conformal Prediction (fast, but less accurate) and Cross-Conformal Prediction (better precision that ICP, faster than CP, overall the best) that are better. Vanilla CP was just the easiest to describe. I'd probably implement Cross-Conformal Prediction.

maciejkula commented 8 years ago

OK, so this is my current understanding.

For training:

  1. Train the model and calibrate it using k-fold splits for the train/calibration sets.
  2. From each calibration split, maintain the distribution of prediction errors (the alphas).
  3. Combine the quantiles of these distributions by averaging them.

For prediction:

  1. Compute the prediction error (alpha) of the new example (from a model trained on all the data?) wrt to a given label y.
  2. Compare this error with the known calibration quantiles.
  3. If the error is lower than the epsilon quantile of the calibrated error distribution, return class y.

Does that sound like a reasonable understanding? If yes, this sounds good to me! We could even make the number of folds used for calibration a hyperparameter, so that the model would reduce to inductive conformal prediction for k_folds = 1.

Do you think this could be extended to multiclass classifiers using the one-vs-rest scheme?

(The above is based mostly on http://arxiv.org/pdf/1208.0806.pdf)

maciejkula commented 8 years ago

Are you still keen to add this?

UserAB1236872 commented 8 years ago

Yes, sorry. I've been busy, but I'll get to work soon.

On Sun, Jan 10, 2016 at 3:30 PM, maciejkula notifications@github.com wrote:

Are you still keen to add this?

— Reply to this email directly or view it on GitHub https://github.com/maciejkula/rustlearn/issues/13#issuecomment-170405978 .

maciejkula commented 8 years ago

Great, looking forward to it!

UserAB1236872 commented 8 years ago

Is there a reason why outputs/labels generally use Array? I got started working, and it may make more sense to just constrain outputs to be Copy+PartialEq, which would allow integral, bool etc labels as well.

maciejkula commented 8 years ago

I think the long-term plan is to have Array<T> which would at the very least allow integral outputs. I'm hoping that will be accomplished by integrating https://github.com/bluss/rust-ndarray.