sgoldenlab / simba

SimBA (Simple Behavioral Analysis), a pipeline and GUI for developing supervised behavioral classifiers
https://simba-uw-tf-dev.readthedocs.io/
GNU General Public License v3.0
289 stars 141 forks source link

How to improve f1-score? #229

Open Urimons opened 1 year ago

Urimons commented 1 year ago

Hello, I'm using SimBA to classify self-grooming behavior in mice. I have gone through all the stages of the GUI and also "played" with the hyperparameters of the training but did not manage to change the f1 score (did the training multiple times and the results are quite similar). Attached here is the output of the "grooming classification report". Grooming_classification_report

What should I do to increase the f1-score? preferably through improving the recall score which is lower. Thanks a lot, Uri

sronilsson commented 1 year ago

Hi @Urimons!

See the answer I just posted [here] for similar question - https://github.com/sgoldenlab/simba/issues/227. I can see fairly large imbalance in your image between grooming present and absent number of frames. Did you try any random undersampling?

Urimons commented 1 year ago

No, I did not tried that before, but I did now. the results are interesting - they seem to be the opposite... as the recall is now high and the precision is low (attached image). I think that is because I don't really understand the meaning of the random undersampling and the sample ratio... what does under sample ratio "1" means actually? and is there another way to do random sampling? I just wrote "random undersample" in the under sample setting...

Grooming_classification_report

sronilsson commented 1 year ago

1 means that you will take all of you grooming annotations (say 1000), and get an equal number N (1000) of random non-grooming annotations and you train the classifier on that data. If you put 1.2 then you will train on all of your 1000 grooming annotations and 1200 random non-grooming annotations. See if you can find a ratio that produces optimal balance f1.

Urimons commented 1 year ago

Thanks!

Urimons commented 1 year ago

The precision and recall is changing but the f1 score is around the same value more or less... Grooming_classification_report Grooming_classification_report Grooming_classification_report This is for under sample ratio of 1, 0.5 and 1.5 respectively...

sronilsson commented 1 year ago

Got it, you can check with learning curves, tick this box. Do you see any discrimination threshold that yields good enough precision and recalls?

Urimons commented 1 year ago

The highest f1-score that I can receive is 0.607 with a discrimination threshold of 0.3.... that's better than what I had before, however I wish I could get a better f1-score, any ideas? Or, is the f1-score of 0.6 considered good for behavioral classification?

sronilsson commented 1 year ago

I can't say of 0.6 is good, I don't know enough about the data. I hate to suggest "annotate more", as it can involve a lot of work. But, I have pasted an answer below I gave on gitter recently.

In general, it is helpful to be selective in the frames that are being annotated (can also save time), rather than annotating indiscriminately: i wrote one tool HERE called pseudo-labelling that can help, another called advanced labelling HERE. There is also a tool HERE to help understand what the classifier has trouble with.

E.g., the key to annotations is not so much quantity as quality. Is there some way if figuring out through visualizations if there is a specific form of grooming event the classifier get wrong, and adding those frames as correct annotations to the classifier? Are there shorter grooming bouts, or when the animal is angled in a specific way, that the classifier don't have many good examples of?