Closed ClemDoum closed 5 years ago
Merging #737 into develop will increase coverage by
0.44%
. The diff coverage is93.49%
.
@@ Coverage Diff @@
## develop #737 +/- ##
===========================================
+ Coverage 88.17% 88.61% +0.44%
===========================================
Files 75 75
Lines 4304 4604 +300
Branches 832 895 +63
===========================================
+ Hits 3795 4080 +285
Misses 387 387
- Partials 122 137 +15
Goal
During experiments done to improve intent classification, using word cooccurrences as additional features to tf-idf was found to be helpful to improve classification performances in cases where words transitions and/or order is meaningful.
For intance, a model using a bag-of-unigram tf-idf features struggle to differentiate the following utterances:
"turn on the light" (TurnLightsOn)
and"is the light turned on" (CheckLightsStatus)
.Adding order cooccurence feature can help in such cases.
Work done
1. Add optional cooccurence features to the intent classification
Transformed the
Featurizer
object into aProcessingUnit
. The featurizer now relies on aTfidfVectorizer
ProcessingUnit
(wrapping thesklearn
TfidfVectorizer
) and optionally on aCooccurrenceVectorizer
.The feature extraction works as following:
added_cooccurrence_feature_ratio
parameter is> 0
the cooccurence features will be added. TheCooccurrenceVectorizer
is fitted on all non null utterances and then cooccurrence features are ranked using the same feature selection as for tf-idf features. Thetop-k
cooccurence features will be used.k
is computed using theadded_cooccurrence_feature_ratio
config parameter: if we haven
tf-idf features then at mostint(n * added_cooccurrence_feature_ratio)
will be added in the feature matrixNote: the order of the words is kept when computing word cooccurrence. Also note that the user can restrict the size of the window in which we'll look for cooccurrence
2. Miscellaneoous
EntityParser
class to make it more generic, and to ease implementation of other entity parsersImportant note: this PR is breaking
Checklist: