Measure the quality of automated classification

Context We're currently trying to understand the SML version that is fit for scale. In the current version of SML, labels were generated manually. Upon discussion between product and technical, it is clear that the product will benefit from an automated definition of the output labels. The starting point will be the IFRC CEA Coding Framework (IFRC CF), to be complemented at a later stage with topic modeling.

Tasks

[x] Measure accuracy of automated classification using our current manually-defined labels
[x] Measure accuracy of automated classification using the IFRC CF*

*since our "test set" is what we labeled using our current labels, we can only test the model using the part of the IFRC CF that overlaps with our current labels.

Next Steps

What's a standard quality benchmark (minimum accuracy?) that should be the "definition of good" for our model+labels
Validate the quality metrics using the full IFRC CF

@ibadyal here's the accuracy of classification for both our coding framework (510-Ukraine) and the IFRC one. Accuracy is defined as % of messages classified with the correct label. Results are shown as a function of the number of examples (manually labeled) that the model is allowed to learn from: with this information, we can decide how much manual work we (or the NS) will need to reach a certain level of accuracy.

Notes:

If I have to suggest what is "good enough" / "standard quality benchmark", I would say 90%.
The main reason of the difference in accuracy between the IFRC and 510-ukraine coding frameworks is in the number of labels: the more labels there are, the harder the problem, the lower the accuracy. The IFRC coding framework uses hierarchical labels, so first classifies among 6 labels, then depending on that label it further classifies in 2-6 sub-labels, etc. The 510-ukraine coding framework, on the contrary, just classifies once among 20 labels or so.
We might be able to increase accuracy with 0 labeled examples by using Large Language Models (we now use smaller, cheaper ones), can be explored during the summer.

rodekruis / social-media-listening

Measure the quality of automated classification #161