Initial behavior

When training the NLU on assistant that have an entity with a large number of values, we noticed the the training time / inference time of the NLU could be majorly impacted.

There were 2 reason for that:

first, the CustomEntityParser which is using a snips_nlu_parser.GazetteerEntityParser under the hood, was not making use of n_gazetteer_stop_words configuration parameter. Using stopwords when matching value of the gazetteer parser can have a dramatic impact on performances
secondly when validating the dataset, we generated way to many variations of the same entity value. On some gazetteer we were going from 50k initial values to 800k values. Generating a lot of variety in the entity values brings robustness but increase both training and inference time. Moreover the generating entity values variations when we already have a lot of values might have a limited effect on robustness

Work done

When building the snips_nlu_parser.GazetteerEntityParser we now use n_gazetteer_stop_words. We set n_gazetteer_stop_words = len(entity_voca) * 0.001 where entity_voca is the number of tokens in the entity vocabulary. This number was chosen after benchmarking several values and several entity data regime
Now we also now generate 3 of string variations differently depending on the data regime:
- if we have less than 1000 entity values, we generate all string variations
- if we have less between 1000 and 10000 value, we generate all variation except the number variations (which are the longest to generate since we have to run Rustling on all entity values)
- if we have more than 10000 entity value we only generate normalization variations

Checklist:

[x] My PR is ready for code review
[x] I have added some tests, if applicable, and run the whole test suite, including linting tests
[x] I have updated the documentation, if applicable

snipsco / snips-nlu

Reduce custom entity parser footprint in training time #804

Initial behavior

Work done