Open stefanobellelli opened 6 years ago
Now it is possible to exclude certain n-grams at will.
However, now POS n-grams can only be discarded in bulk, and not by the words they stand for. For example, one can discard all VBZ_DT, but can cannot discard solely those POS n-grams that stand for the word n-grams "is_a" or the like.
Ah... okay. I thought spacy had specific tags for auxiliaries but I see it is not the case. Then that's actually good news because the pattern is unlikely to affect the POS tag experiments (since it presumably stands for many more combinations).
I would then run on the word bigrams removing is-a patterns and see what happens there. Just to be sure.
So... are we actually interested in developing a more precise feature to block the POS n-grams that stand for is_a
word n-grams? Because it would take a day or so to implement.
No, no. I simply mean that when you run the SVM on word bigrams rather than POS, you can exclude is-a then. Does that make sense?
Yeah I got that, it's already possible (just add the n-grams you'd like to discard in Conf.exclude
).
It should be possible to discard (i.e. not count when building vectors) arbitrarily-chosen n-grams.