snipsco / snips-nlu

Snips Python library to extract meaning from text
https://snips-nlu.readthedocs.io
Apache License 2.0
3.9k stars 513 forks source link

[Slot Filling] Improve data augmentation to make sure possible tag transitions are well represented #728

Open ClemDoum opened 5 years ago

ClemDoum commented 5 years ago

Problem description

Short description

In certain conditions some CRF tags transitions can by missing after the data augmentation or can be "underrepresented". We must ensure that all possible tags transitions are in the augmented dataset so that inference does not fail systematically on those examples

Example

Given a dataset with 1 intent and 3 slots: slot_1, slot_2, slot_3

If in the dataset only has 5% utterances with the following pattern: bla bla [slot_1] [slot_2] bla bla and slot_1 only has 5% of length 1 entity values and 95% of length 2 entities values. Then when augmenting the data the probability of getting a the pattern B-slot-1 B-slot-2 in your training data becomes 0,0025 and will probably missing from your training data.

If slot_1 has the value word_1 and slot_2 has the value word_2 word_3, if the CRF sees: "word_1 word_2 word_3" then it will tag it as "B-slot-1 I-slot-1 B-slot-2" instead of "B-slot-1 B-slot-2 I-slot-2" because it has never seen this transition in the training data.

Now let's say that unluckily people use 95% of the time the length 1 value of the slot 1 then it means that the CRF will systematically fail in 95%*5%=4.75% of the cases, which is pretty high

Potential solutions