sgoldenlab / simba

SimBA (Simple Behavioral Analysis), a pipeline and GUI for developing supervised behavioral classifiers
https://simba-uw-tf-dev.readthedocs.io/
GNU General Public License v3.0
290 stars 141 forks source link

Builing a model with few frames containing the behavior of interest #185

Open carlitomu opened 2 years ago

carlitomu commented 2 years ago

Hi, I'm building an analysis model with only one classified behavior. The model is based on 24 videos, where this behavior on average is present around 1.4% of all frames labeled. This ratio (1.4%) is also the same in other videos that I've not incorporated in the model.

I've read your papers (1 - 2) and I was wondering if this aspect (the low presence of the behavior in the videos) can affect negatively the model, because as you wrote "One weakness of random forests is their inability to natively support biased datasets. These datasets are common in behavioral videos, in which most frames do not contain the behavior of interest" (or "Random forest classifiers - as other classification techniques - are sensitive to class imbalances").

If this is the case, should I change any hyperparameters in the settings? And how? I see (and read in the papers) "Under sample setting" and "Under sample ratio" parameters: do I have to change their values?

Thanks!

sronilsson commented 2 years ago

Hi @carlitomu! Yes, this imbalance (98.6% no behavior vs 1.4% behavior) is typical, and one of the main hurdles for getting an accurate classifier up and running. The issue is that the classifier can reach 98.6% accuracy, just by guessing "no behavior" on all frames, and we need to stop this from happening.

There are a few ways to solve this, in SimBA I recommend first trying the random undersampling. To do this, first set the Under sample setting entry box to Random undersample and set the Under sample ratio entry box to 1.0. This number means that you will enter all of your behavior frames into the algorithm, and an equal count of "no behavior" frames. E.g., if you have 1000 frames annotated with the behavior present, and enter 1.0 in the Under sample ratiobox, SimBA will use those together with 1000 randomly selected non-behavior frames.

If you insert 0.8 in the Under sample ratio entry box, SimBA will use the 1000 annotated frames, together with 800 randomly selected non-behavior frames.

If you insert 1.2 in the Under sample ratio entry box, SimBA will use the 1000 annotated frames, together with 1200 randomly selected non-behavior frames.

Which exact number to use for the Under sample ratio is empirical and depends on your behavior and videos, but 1.0 is usually a good place to start! If you see that the classifier is over-classifying frames as containing your behavior, then increase the Under sample ratio. If the classifier is under-classifying frames as containing your behavior, then try to decrease the Under sample ratio.