nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

Produce metrics to show which labels are being classified correctly and which aren't, and how they're being misclassified. #61

Closed bruffridge closed 3 years ago

bruffridge commented 3 years ago

Perhaps doing multiple confusion matrices would give this information?

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.multilabel_confusion_matrix.html

elkong commented 3 years ago

Preliminary analysis at this point (MAG and MeSH labels are correctly being considered): precision_vs_freq_20210720 recall_vs_freq_20210720 It seems that labels with more occurrences in the training data are classified correctly more often (or at least, frequency correlates positively with precision and recall), although there's a lot of noise because our dataset lacks many examples for most labels.

bruffridge commented 3 years ago

Thanks @elkong . Can you generate something like this too if you think it will give us more information to improve the labeller? https://stackoverflow.com/questions/48872738/understanding-multi-label-classifier-using-confusion-matrix

hxH2E

elkong commented 3 years ago

@bruffridge Okay, I do have a plot like this! First, some background: my previous attempts two weeks ago with scikit-learn's multilabel confusion matrix function, the one you linked in the top level comment, didn't give very much cross-label information, because it turns out it "binarizes multiclass data under a one-versus-rest transformation", i.e., it stops paying attention to which labels are being misclassified as which other labels, and treats all labels as independent. In effect, it gave me 131 2x2 matrices, one for each label, like this:

CONFUSION MATRIX FORMAT:
[[true_negatives  false_negatives]
 [false_positives true_positives ]]
regulate_reproduction_or_growth
[[92  1]
 [ 7  0]]
capture_energy
[[99  0]
 [ 1  0]]
chemically_assemble_organic_compounds
[[92  2]
 [ 6  0]]
attach
[[79 13]
 [ 8  0]]
sense_send_or_process_information
[[66 13]
 [20  1]]
prevent_fracture/rupture
[[96  1]
 [ 2  1]] 

So today I went and calculated the 131-by-131 matrix without using that. Here's a link to the my confusion matrix, plotted (I hope it works):

Multilabel confusion matrix for MATCH on PeTaL data

In this confusion matrix we plot the average probability of predicting a predicted label based on what ground truth label is present in the test set. Bright spots indicate probabilities closer to 1, and darker spots indicate probabilities closer to 0.

The (slight) diagonal bright streak indicates the probability of assigning a predicted label to the a paper with that same ground truth label. Observe that some parent-label ("Level I") columns (e.g., move and sense_send_or_process_information) have lots of bright spots down the column. This is expected, as all of their leaf labels should correctly have their parent labels also predicted in their papers.

Because there are 131 labels in this matrix, it is a very big diagram, and it's also very unwieldy. Future work in confusion matrix visualization may focus on a subset of such labels (perhaps the most common ones).

bruffridge commented 3 years ago

@elkong Very nice! Can you only show leaf labels, since that's really all we want our labeller to target.

elkong commented 3 years ago

@bruffridge yes! I have some further analysis in this vein (which I also shared at our meeting):

If you sort the labels by the frequency by which they appear in my training subset, you get this (the more frequent labels occur at the left and at the top): Multilabel Confusion Matrix, Labels Sorted by Frequency

If you filter out all the non-leaf labels so all you have are leaf labels, you get this: Multilabel Confusion Matrix, Leaf Labels Sorted by Frequency

A close-up of the top 20 leaf labels is here: Multilabel Confusion Matrix, Top 20 Leaf Labels Sorted by Frequency

Do note that I have not dropped any labels at the training stage, so all of these plots come from the same training/evaluation results. The filtering came at the plotting stage -- the last two plots are just filtered versions of the first plot.