stefanobellelli / nonce3vec

Fast-mapping machines making informed decisions.
MIT License
0 stars 0 forks source link

Add option to exclude certain n-grams #6

Open stefanobellelli opened 6 years ago

stefanobellelli commented 6 years ago

It should be possible to discard (i.e. not count when building vectors) arbitrarily-chosen n-grams.

stefanobellelli commented 6 years ago

Now it is possible to exclude certain n-grams at will.

However, now POS n-grams can only be discarded in bulk, and not by the words they stand for. For example, one can discard all VBZ_DT, but can cannot discard solely those POS n-grams that stand for the word n-grams "is_a" or the like.

minimalparts commented 6 years ago

Ah... okay. I thought spacy had specific tags for auxiliaries but I see it is not the case. Then that's actually good news because the pattern is unlikely to affect the POS tag experiments (since it presumably stands for many more combinations).

I would then run on the word bigrams removing is-a patterns and see what happens there. Just to be sure.

stefanobellelli commented 6 years ago

So... are we actually interested in developing a more precise feature to block the POS n-grams that stand for is_a word n-grams? Because it would take a day or so to implement.

minimalparts commented 6 years ago

No, no. I simply mean that when you run the SVM on word bigrams rather than POS, you can exclude is-a then. Does that make sense?

stefanobellelli commented 6 years ago

Yeah I got that, it's already possible (just add the n-grams you'd like to discard in Conf.exclude).