Closed GoogleCodeExporter closed 9 years ago
One current behavior that seems odd to me, is that stopword tokens are just
removed from the list of total tokens, before ngrams are computed. This
results in skip-ngram-type ngrams.
Sentence: "Birds chase cats."
[run FE]
assertTrue(features.contains(new Feature("view2NG_birds_chase_cats", 1)));
assertTrue(features.contains(new Feature("view2NG_birds_cats", 0)));
extractor.stopwords = new HashSet<String>();
extractor.stopwords.add("chase");
[run FE again]
assertTrue(features.contains(new Feature("view2NG_birds_cats", 1)));
Do we want this?
Original comment by EmilyKJa...@gmail.com
on 2 Feb 2014 at 12:48
No, in skip ngrams this is definitely not the intended behaviour.
Original comment by torsten....@gmail.com
on 2 Feb 2014 at 12:53
That means, in consequence, n-grams need to be glued together *before* applying
the stopword filter (in case n > 1), and we need another parameter to allow for
"strict" or "soft" filtering.
Original comment by daxenber...@gmail.com
on 3 Feb 2014 at 2:15
This issue was updated by revision r568.
I cleaned up filterNgram: previous variable names were misleading. I added
suggested code to switch between filtering for any stopwords and filtering for
all stopwords. If we want to continue, this needs to be passed in by parameter.
I also fixed the bug resulting in "skip ngrams" created by stopword removal.
Original comment by EmilyKJa...@gmail.com
on 3 Feb 2014 at 2:56
This issue was updated by revision r569.
I fixed a couple small bugs introduced in the last revision.
Original comment by EmilyKJa...@gmail.com
on 3 Feb 2014 at 5:16
Fixed with r571
Original comment by torsten....@gmail.com
on 4 Feb 2014 at 3:03
Original comment by daxenber...@gmail.com
on 6 Feb 2014 at 9:56
Original issue reported on code.google.com by
torsten....@gmail.com
on 2 Feb 2014 at 12:40