Closed jobarber closed 3 years ago
We need to include tests and a dataset in this notebook (or package). Perhaps I will use metaphor datasets for now.
I am presently fine tuning a BERT model on a metaphor dataset from ACL. It adds masked probabilities to the normal BERT model for classification. I am only 3 epochs in, but you can see some improvement. I suspect we would like to prioritize precision over recall. Here are the classification reports for the first three epochs on the validation set. You can see it getting a little better.
Epoch 0
precision recall f1-score support
no_metaphor 0.99 0.59 0.74 80052
metaphor 0.26 0.97 0.41 11631
accuracy 0.64 91683
macro avg 0.63 0.78 0.58 91683
weighted avg 0.90 0.64 0.70 91683
Epoch 1
precision recall f1-score support
no_metaphor 0.99 0.66 0.79 80052
metaphor 0.29 0.96 0.44 11631
accuracy 0.70 91683
macro avg 0.64 0.81 0.62 91683
weighted avg 0.90 0.70 0.75 91683
Epoch 2
precision recall f1-score support
no_metaphor 0.99 0.72 0.83 80052
metaphor 0.33 0.94 0.48 11631
accuracy 0.75 91683
macro avg 0.66 0.83 0.66 91683
weighted avg 0.90 0.75 0.79 91683
If we cannot improve upon it, we will use Baseline 3: BERT below (which is a very straightforward task that anyone can easily build). This script was for detecting metaphors. We can apply similarly to frame blending I think. Note that the precision for Baseline 3: Bert is .712 and the recall .725. So 71.2% of the metaphors it labels are actually metaphors (precision), and the model correctly identifies 72.5% of all actual metaphors as metaphors (recall).
The model I reference in the above comments is a variation on this baseline model. I also wonder if incorporating GloVe windows around each word might be a good addition to this model.
Rank Team P R F1
All POS
1 DeepMet .756 .783 .769
2 Go Figure! .721 .748 .734
3 illiniMet .746 .715 .730
4 rowanhm .727 .709 .718
5 Baseline 3: BERT .712 .725 .718
6 zhengchang .696 .729 .712
7 chasingkangaroos .702 .704 .703
8 Duke Data Science .662 .699 .680
9 Zenith .630 .716 .670
10 umd bilstm .733 .601 .660
11 atr2112 .599 .672 .633
12 PolyU-LLT .556 .660 .603
13 iiegn .601 .591 .596
14 UoB team .653 .548 .596
15 Baseline 2: bot.zen .612 .575 .593
16 Baseline 1: UL + .510 .696 .589
Table 4: VUA Dataset: Performance and ranking of
the best system per team and baselines, for All POS
track (top panel) and for Verbs track (bottom panel).
See pages 34-37 (with reference to the entire pdf; ignore the page numbers at the bottom of each page, which are relative) for more details:
https://www.aclweb.org/anthology/2020.figlang-1.pdf
And the full architecture of the Baseline model can be found on page 250 of the same document (with reference to the entire pdf).
(A number of the other teams used ensemble methods, which are useful for hill climbing but are not always useful in production.)
@jobarber has definitely completed this issue. Now we are just working with the various models.
We have constructed a couple quick toy models. We now need a notebook (or package) we can all share to try out different systems and see how we can find the best performing system.