Closed soodoku closed 8 years ago
@soodoku you are right, there are several factors that affect the learning process.
In fact, some algorithms have their performance undermined when dealing with imbalanced datasets only when the data forms small disjuncts clusters and/or classes overlap.
But I believe what is meant by this statement is that, in general, class imbalance problem has to be handled by either by preprocessing (under/over sampling), algorithm modification (changing the algorithm) or cost-sensitive learning. And, the preprocessing approach is a broad generalization that usually works for any classifier.
Let's see what others has to say. If that's the case, we can change the statement to:
Most of the standard learning algorithms consider a balanced training set, this may generate suboptimal classification models when learning from imbalanced data [1]
[1] http://www.sciencedirect.com/science/article/pii/S0020025513005124
Class-imbalanced datasets are challenging to analyze, as the performance of common classification models (for example decision trees) on class-imbalanced datasets tends to be suboptimal [1]
[1] Taft L.M., Evans R.S., Shyu C.R., Egger M.J., Chawla N., Mitchell J.A., Thornton S.N., Bray B., Varner M. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. J. Biomed. Inform. 2009;42:356–364.
Thanks @dvro
The word "most" in the context of the sentence means greater than 50%. So that is certainly unwarranted. "only perform optimally" is also wrong, afaik. Logistic has a small sample bias (for a 101 kind of a paper + proposed cure, see http://gking.harvard.edu/files/0s.pdf) and there may be other issues but I would be surprised if the statement charitably edited would make sense.
But no point fussing too much, I suppose. The ML motto is do whatever as long as it gives you good out of sample performance.
It would be great to cite that claim.
(To the best of my knowledge, and after carefully thinking about the point, my sense is that the claim is incorrect. Impact of imbalance on performance depends on variance within the class(es), amount of data, not just the split, etc. )