Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same.

scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

https://imbalanced-learn.org

MIT License

6.84k stars 1.28k forks source link

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. #103

Closed soodoku closed 8 years ago

soodoku commented 8 years ago

It would be great to cite that claim.

(To the best of my knowledge, and after carefully thinking about the point, my sense is that the claim is incorrect. Impact of imbalance on performance depends on variance within the class(es), amount of data, not just the split, etc. )

dvro commented 8 years ago

@soodoku you are right, there are several factors that affect the learning process.

In fact, some algorithms have their performance undermined when dealing with imbalanced datasets only when the data forms small disjuncts clusters and/or classes overlap.

But I believe what is meant by this statement is that, in general, class imbalance problem has to be handled by either by preprocessing (under/over sampling), algorithm modification (changing the algorithm) or cost-sensitive learning. And, the preprocessing approach is a broad generalization that usually works for any classifier.

Let's see what others has to say. If that's the case, we can change the statement to:

Most of the standard learning algorithms consider a balanced training set, this may generate suboptimal classification models when learning from imbalanced data [1]

[1] http://www.sciencedirect.com/science/article/pii/S0020025513005124

chkoar commented 8 years ago

Class-imbalanced datasets are challenging to analyze, as the performance of common classification models (for example decision trees) on class-imbalanced datasets tends to be suboptimal [1]

[1] Taft L.M., Evans R.S., Shyu C.R., Egger M.J., Chawla N., Mitchell J.A., Thornton S.N., Bray B., Varner M. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. J. Biomed. Inform. 2009;42:356–364.

soodoku commented 8 years ago

Thanks @dvro

The word "most" in the context of the sentence means greater than 50%. So that is certainly unwarranted. "only perform optimally" is also wrong, afaik. Logistic has a small sample bias (for a 101 kind of a paper + proposed cure, see http://gking.harvard.edu/files/0s.pdf) and there may be other issues but I would be surprised if the statement charitably edited would make sense.

But no point fussing too much, I suppose. The ML motto is do whatever as long as it gives you good out of sample performance.