simple notebooks to load several intent classifiers and conditional random fields for entity extraction similar to a voice assistant's NLU engine.
Overview:
We want developers to easily learn how NLU engines work. We want to make it easy to understand the basic components of an NLU engine, and to understand how to use them to perform basic tasks.
Strangely we couldn't find an example of building a basic intent matcher used in an NLU engine for voice assistants, where utterances (ie user commands/questions) are classified by their intent category. So we decided to make a simple one for everyone out there.
NLU_Intent_matching.ipynb
explores the basic idea of intent matching on utterances and entity extraction. It is used mostly for learning purposes. Using the NLU Evaluation Dataset as a bases of data, both a word embedding approach (using Word2Vec) and a TFIDF approach are explored using the following algorithms to train intent classifiers:
More may be added in the future.
This notebook uses conditional random fields (CRFs) for entity extraction.
clean_dataset.ipynb
is a notebook to clean the NLU dataset. The NLU Evaluation Dataset, is the only broad voice assistant dataset we could find. However, it does have issues. It is important to clean and refine the dataset. We couldn't find an existing solution that was satisfactory. Therefore, we built a prototype to clean the dataset. You don't have to run this yourself, as the dataset has already been cleaned and is in data/NLU-Data-Home-Domain-Annotated-All-Cleaned.csv
.
Based on the teaching example from NLU_Intent_matching.ipynb
, we built an NLU engine located in utils/
. The engine is a simple NLU engine that uses intent classifiers and CRFs for entity extraction. By default it uses TFIDF to encode the utterances and CRFs to extract entities. Personally, I am satified with the performance of the intent classifier, however I think the features of the CRFs are worth exploring further. Perhaps Brown clustering could be used to improve the performance of the CRFs.
TODO: improve the performance of the CRFs, write a report on the feature use to determine which features perform the best and how well they perform.
An example of using the engine is shown in the notebook example_of_intent_and_entity_classification_with_NLU_engine_class.ipynb
.
I am sure other improvements could be made to the engine, we are open to PRs.
The notebook Macro_NLU_Data_Refinement.ipynb
is a prototype notebook to refine the NLU dataset. It uses the basic NLU engine described above with Logistic Regression as the intent classifier to spot issues. It creates reports on the data and the classifier performance, and it allows a user to go through the different domains (skills, scenarios, whatevere you want to call them) and refine the data using IPYsheets, which are like excel spreadsheets embedded into the notebook. A user can refine both intents and entities.
Once the dataset has been refined, we can use a more advanced NLU engine to train a model. Snips is a good example of a low resource engine while the DistilBERT joint classifier we are building is a good example of a state of the art engine. The models will be publicially released in the future.
TODO: