When training the NLU on assistant that have an entity with a large number of values, we noticed the the training time / inference time of the NLU could be majorly impacted.
There were 2 reason for that:
first, the CustomEntityParser which is using a snips_nlu_parser.GazetteerEntityParser under the hood, was not making use of n_gazetteer_stop_words configuration parameter. Using stopwords when matching value of the gazetteer parser can have a dramatic impact on performances
secondly when validating the dataset, we generated way to many variations of the same entity value. On some gazetteer we were going from 50k initial values to 800k values. Generating a lot of variety in the entity values brings robustness but increase both training and inference time. Moreover the generating entity values variations when we already have a lot of values might have a limited effect on robustness
Work done
When building the snips_nlu_parser.GazetteerEntityParser we now use n_gazetteer_stop_words. We set n_gazetteer_stop_words = len(entity_voca) * 0.001 where entity_voca is the number of tokens in the entity vocabulary. This number was chosen after benchmarking several values and several entity data regime
Now we also now generate 3 of string variations differently depending on the data regime:
if we have less than 1000 entity values, we generate all string variations
if we have less between 1000 and 10000 value, we generate all variation except the number variations (which are the longest to generate since we have to run Rustling on all entity values)
if we have more than 10000 entity value we only generate normalization variations
Checklist:
[x] My PR is ready for code review
[x] I have added some tests, if applicable, and run the whole test suite, including linting tests
[x] I have updated the documentation, if applicable
Initial behavior
When training the NLU on assistant that have an entity with a large number of values, we noticed the the training time / inference time of the NLU could be majorly impacted.
There were 2 reason for that:
CustomEntityParser
which is using asnips_nlu_parser.GazetteerEntityParser
under the hood, was not making use ofn_gazetteer_stop_words
configuration parameter. Using stopwords when matching value of the gazetteer parser can have a dramatic impact on performances50k
initial values to800k
values. Generating a lot of variety in the entity values brings robustness but increase both training and inference time. Moreover the generating entity values variations when we already have a lot of values might have a limited effect on robustnessWork done
snips_nlu_parser.GazetteerEntityParser
we now usen_gazetteer_stop_words
. We setn_gazetteer_stop_words = len(entity_voca) * 0.001
whereentity_voca
is the number of tokens in the entity vocabulary. This number was chosen after benchmarking several values and several entity data regime1000
entity values, we generate all string variations1000
and10000
value, we generate all variation except the number variations (which are the longest to generate since we have to run Rustling on all entity values)10000
entity value we only generate normalization variationsChecklist: