scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.35k stars 25.24k forks source link

Default values for nu and tol in OneClassSVM #12249

Open albertcthomas opened 5 years ago

albertcthomas commented 5 years ago

I think the default value for nu in the OneClassSVM should be 0.1 and not 0.5. As nu roughly corresponds to the fraction of outliers, it makes more sense to set it to a lower value.

amueller commented 5 years ago

Why 0.1 and/or why 0.5? I assume we took the default from liblinear. I would rather not change it unless there's significant evidence that the current value doesn't make sense.

albertcthomas commented 5 years ago

For outlier detection I think that we always assume the proportion of outliers to be small, say 5 or 10% rather than 50%. For novelty detection, nu is the false positive rate that we also want to be small in general. The scikit-learn examples involving the OneClassSVM use the following values for nu: 0.1, 0.1, 0.15, 0.261.

albertcthomas commented 5 years ago

Other outlier detection estimators were all using a contamination parameter with default value equal to 0.1. Now IsolationForest and LOF use 'auto'.

amueller commented 5 years ago

libsvm indeed has

-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)

Maybe they just wanted to have it consistent between the three classes, which is indeed not a good reason.

Relevant statement from http://users.cecs.anu.edu.au/~williams/papers/P132.pdf: image

From the discussion in the paper it looks like 0.1 would indeed make more sense if we want to be consistent with the other estimators.

varunpillai commented 5 years ago

what would be the default value in case OneClassSVM is used for Novelty detection rather than Outlier detection

albertcthomas commented 5 years ago

For novelty detection nu corresponds to the false positive rate, hence 0.1 or 0.05 is a usual default value.

varunpillai commented 5 years ago

I guess I am a bit confused with the usage of One class SVM for novelty detection. The documentation says "The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. " So that would mean that the model would consider the entire dataset i provide for training as one class without. However, in reality the model classifies each record as either a 1 or 0 for outlier or not. Is this correct behavior or am I doing something wrong.

albertcthomas commented 5 years ago

This is the correct behavior. The model will classify some records as being outliers even if they are in fact normal records (false positives).

glemaitre commented 2 years ago

I let this issue open since it could make sense to change the default values of OneClassSVM. I closed the associated PR since we need more investigation regarding the stability of the algorithm with a smaller nu default value.