Generate 3-5 more experimental models for detecting frame blends (or related)

jobarber commented 3 years ago

We have constructed a couple quick toy models. We now need a notebook (or package) we can all share to try out different systems and see how we can find the best performing system.

jobarber commented 3 years ago

We need to include tests and a dataset in this notebook (or package). Perhaps I will use metaphor datasets for now.

jobarber commented 3 years ago

I am presently fine tuning a BERT model on a metaphor dataset from ACL. It adds masked probabilities to the normal BERT model for classification. I am only 3 epochs in, but you can see some improvement. I suspect we would like to prioritize precision over recall. Here are the classification reports for the first three epochs on the validation set. You can see it getting a little better.

Epoch 0

              precision    recall  f1-score   support

 no_metaphor       0.99      0.59      0.74     80052
    metaphor       0.26      0.97      0.41     11631

    accuracy                           0.64     91683
   macro avg       0.63      0.78      0.58     91683
weighted avg       0.90      0.64      0.70     91683

Epoch 1

              precision    recall  f1-score   support

 no_metaphor       0.99      0.66      0.79     80052
    metaphor       0.29      0.96      0.44     11631

    accuracy                           0.70     91683
   macro avg       0.64      0.81      0.62     91683
weighted avg       0.90      0.70      0.75     91683

Epoch 2

              precision    recall  f1-score   support

 no_metaphor       0.99      0.72      0.83     80052
    metaphor       0.33      0.94      0.48     11631

    accuracy                           0.75     91683
   macro avg       0.66      0.83      0.66     91683
weighted avg       0.90      0.75      0.79     91683

jobarber commented 3 years ago

If we cannot improve upon it, we will use Baseline 3: BERT below (which is a very straightforward task that anyone can easily build). This script was for detecting metaphors. We can apply similarly to frame blending I think. Note that the precision for Baseline 3: Bert is .712 and the recall .725. So 71.2% of the metaphors it labels are actually metaphors (precision), and the model correctly identifies 72.5% of all actual metaphors as metaphors (recall).

The model I reference in the above comments is a variation on this baseline model. I also wonder if incorporating GloVe windows around each word might be a good addition to this model.

Rank Team P R F1
All POS
1 DeepMet .756 .783 .769
2 Go Figure! .721 .748 .734
3 illiniMet .746 .715 .730
4 rowanhm .727 .709 .718
5 Baseline 3: BERT .712 .725 .718
6 zhengchang .696 .729 .712
7 chasingkangaroos .702 .704 .703
8 Duke Data Science .662 .699 .680
9 Zenith .630 .716 .670
10 umd bilstm .733 .601 .660
11 atr2112 .599 .672 .633
12 PolyU-LLT .556 .660 .603
13 iiegn .601 .591 .596
14 UoB team .653 .548 .596
15 Baseline 2: bot.zen .612 .575 .593
16 Baseline 1: UL + .510 .696 .589

Table 4: VUA Dataset: Performance and ranking of
the best system per team and baselines, for All POS
track (top panel) and for Verbs track (bottom panel).

See pages 34-37 (with reference to the entire pdf; ignore the page numbers at the bottom of each page, which are relative) for more details:

https://www.aclweb.org/anthology/2020.figlang-1.pdf

And the full architecture of the Baseline model can be found on page 250 of the same document (with reference to the entire pdf).

jobarber commented 3 years ago

(A number of the other teams used ensemble methods, which are useful for hill climbing but are not always useful in production.)

textpotential commented 3 years ago

@jobarber has definitely completed this issue. Now we are just working with the various models.

hlab-repo / purity-and-danger

Generate 3-5 more experimental models for detecting frame blends (or related) #1