QUAC ("quantitative analysis of chatter" or any related acronym you like) is a package for acquiring and analyzing social Internet content. Docs are online at http://reidpr.github.io/quac.
Apache License 2.0
68
stars
28
forks
source link
related work on active learning with high class imbalance #70
"Guided Learning:" user searches for an equal distribution of positive and negative examples
This is often much better than active learning...
_...but the human cost may be greater (must search for _and* label examples)
A hybrid strategy can work well:
Given a predefined cost factor (e.g., guided learning is 8 times costlier than active learning), compute a cost-benefit tradeoff to determine when to switch from guided to active learning
"The certainty of the underlying model that the as-of-yet unexplored disjuncts are of the wrong class hinders
the success of the active learning ". Translation: when the minority class consists of multiple clusters, finding those different clusters quickly is important. Here's an interesting plot of this, where different clusters of blue dots represent subspaces of the minority class. Bumps in accuracy seem to coincide with discovering these subspaces.