snipsco / snips-nlu

Snips Python library to extract meaning from text
https://snips-nlu.readthedocs.io
Apache License 2.0
3.9k stars 513 forks source link

Add word cooccurrence as intent classification features #737

Closed ClemDoum closed 5 years ago

ClemDoum commented 5 years ago

Goal

During experiments done to improve intent classification, using word cooccurrences as additional features to tf-idf was found to be helpful to improve classification performances in cases where words transitions and/or order is meaningful.

For intance, a model using a bag-of-unigram tf-idf features struggle to differentiate the following utterances: "turn on the light" (TurnLightsOn) and "is the light turned on" (CheckLightsStatus).

Adding order cooccurence feature can help in such cases.

Work done

1. Add optional cooccurence features to the intent classification

Transformed the Featurizer object into a ProcessingUnit. The featurizer now relies on a TfidfVectorizer ProcessingUnit (wrapping the sklearn TfidfVectorizer) and optionally on a CooccurrenceVectorizer.

The feature extraction works as following:

  1. tf-idf features are extracter from the dataset
  2. feature selection is applied to select the best feature
  3. if configuration added_cooccurrence_feature_ratio parameter is > 0 the cooccurence features will be added. The CooccurrenceVectorizer is fitted on all non null utterances and then cooccurrence features are ranked using the same feature selection as for tf-idf features. The top-k cooccurence features will be used. k is computed using the added_cooccurrence_feature_ratio config parameter: if we have n tf-idf features then at most int(n * added_cooccurrence_feature_ratio) will be added in the feature matrix

Note: the order of the words is kept when computing word cooccurrence. Also note that the user can restrict the size of the window in which we'll look for cooccurrence

2. Miscellaneoous

Important note: this PR is breaking

Checklist:

codecov-io commented 5 years ago

Codecov Report

Merging #737 into develop will increase coverage by 0.44%. The diff coverage is 93.49%.

@@             Coverage Diff             @@
##           develop     #737      +/-   ##
===========================================
+ Coverage    88.17%   88.61%   +0.44%     
===========================================
  Files           75       75              
  Lines         4304     4604     +300     
  Branches       832      895      +63     
===========================================
+ Hits          3795     4080     +285     
  Misses         387      387              
- Partials       122      137      +15