snipsco / snips-nlu

Snips Python library to extract meaning from text
https://snips-nlu.readthedocs.io
Apache License 2.0
3.89k stars 513 forks source link

expand builtin entities #863

Closed Garstig closed 4 years ago

Garstig commented 4 years ago

Question

Hi,

is it possible to expand builtin entities? I want to expand the snips/datetime entity in German, so it can detect words like "gerade". Also sometimes the parser does not detect slots, that are really comon in the utterances. For example the word "morgen" is just ignored.

Are the bultin entitiy parser somehow trained on the utterances or are they just the gazetters? Gazetters are just some sort of fancy rules that rely on dictornaries, aren't they?

I am also really confused by the gazetters file "/gazetteers/top_200000_words_stemmed.txt" in the language resource directory. For me it looks like just random words? Is there somewhere a mapping that orders all the lines to categories?

Bests Garstig

adrienball commented 4 years ago

Hi @Garstig , The builtin entities cannot be expanded. As a workaround to catch values like "gerade", you can create a custom datetime entity (on top of the builtin snips/datetime) and add some logic to handle the custom entity values.

The builtin entity parser can parse two types of builtin entities that we call Grammar Entities and Gazetteer Entities. Grammar entities, under the hood, are handled by the rustling library. As their names suggest, they are based on grammars. On the other hand a gazetteer entity is just made of a simple list of values that we have curated.

The gazetteers files that you see in the language resource directory are not linked to builtin entities. These files are used by snips-nlu mainly to identify uncommon words. This allows to build relevant features for the machine learning models which are used to classify intents and to extract slots. Best

Garstig commented 4 years ago

@adrienball Thank you very much for your response!