tmadl / semisup-learn

Semi-supervised learning frameworks for python, which allow fitting scikit-learn classifiers to partially labeled data
MIT License
501 stars 153 forks source link

Different score after run the example #1

Closed ouceduxzk closed 8 years ago

ouceduxzk commented 8 years ago

This is what I got after run the heart dataset example, seems, the semi-supervised learning does not help. I did not edit anything though, not sure where this inconsistency comes from. supervised log.reg. score 0.555555555556 self-learning log.reg. score 0.377777777778 ..n....n....n.n max_iter exceeded. CPLE semi-supervised log.reg. score 0.555555555556 ..nn..n...n..n. max_iter exceeded. CPLE semi-supervised RBF SVM score 0.555555555556

tmadl commented 8 years ago

Hi,

Thanks for pointing this out. "heart" is actually not the best dataset for demonstration. There are also still issues with my optimization procedure (I can't always find good minima).

scikit's logistic regression start with random init parameters and then looks for a local minimum using newton-cg/lbfgs. Some random initial parameters work much better than others (find better minima), leading to high accuracies.

Unfortunately, the optimizer (DIRECT) I am using now for the semi-supervised version (CPLE) is not guaranteed to find a global minimum of its objective function either. For this reason, sometimes it performs worse than the supervised version (whenever its random initialization is such that it gets stuck in a bad minimum, whereas logistic regression got lucky).

I am looking into this, and plan to improve the optimization once I have time to do so. It is a tricky issue, as the landscape of the objective function is very rough, far from convex. But the theoretical guarantee of CPLE (accuracy no worse than supervised alternative) only holds if the actual global minimum is found. If it isn't, then the supervised solution can be better than the semi-supervised one, unfortunately.

For now, the thing to try in order to increase the accuracy of the semi-supervised result is to increase the "max_iter" parameter to a higher number than the default of 3000. Or run CPLE multiple times, with multiple initializations.

(It's also important to choose an appropriate base model for the kind of data one has. For example, if the dataset can be approximated by normals (eg. apply Mardia's test of multivariate normality), then WQDA often beats other base classifiers, as it is based on normal distributions.)

I've changed the demo to a more appropriate dataset for now (using the Lung Cancer Ontario dataset, which doesn't have a clear low-density region around a linear decision boundary [unlike the heart dataset], decreasing the likelihood of a simple method such as logistic regression performing well with a lucky initialization). I am still looking into finding better minima for the CPLE objective function.

tmadl commented 8 years ago

P.S.: Due to the random initializations, getting slightly different scores from the ones mentioned in the example is perfectly normal