google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

Add parameter to make behaviour of ngram stopword filtering configurable #87

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
We should be able to switch between:
a) filter ngram if it contains any stopword, and
b) filter ngram if it entirely consists of stopwords.

a) should be the default according to web reseach by Oliver.

Original issue reported on code.google.com by torsten....@gmail.com on 2 Feb 2014 at 12:40

GoogleCodeExporter commented 9 years ago
One current behavior that seems odd to me, is that stopword tokens are just 
removed from the list of total tokens, before ngrams are computed.  This 
results in skip-ngram-type ngrams.

Sentence: "Birds chase cats."
[run FE]
assertTrue(features.contains(new Feature("view2NG_birds_chase_cats", 1)));
assertTrue(features.contains(new Feature("view2NG_birds_cats", 0)));

extractor.stopwords = new HashSet<String>();
extractor.stopwords.add("chase");
[run FE again]

assertTrue(features.contains(new Feature("view2NG_birds_cats", 1)));

Do we want this?

Original comment by EmilyKJa...@gmail.com on 2 Feb 2014 at 12:48

GoogleCodeExporter commented 9 years ago
No, in skip ngrams this is definitely not the intended behaviour.

Original comment by torsten....@gmail.com on 2 Feb 2014 at 12:53

GoogleCodeExporter commented 9 years ago
That means, in consequence, n-grams need to be glued together *before* applying 
the stopword filter (in case n > 1), and we need another parameter to allow for 
"strict" or "soft" filtering.

Original comment by daxenber...@gmail.com on 3 Feb 2014 at 2:15

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r568.

I cleaned up filterNgram: previous variable names were misleading.  I added 
suggested code to switch between filtering for any stopwords and filtering for 
all stopwords.  If we want to continue, this needs to be passed in by parameter.
I also fixed the bug resulting in "skip ngrams" created by stopword removal.  

Original comment by EmilyKJa...@gmail.com on 3 Feb 2014 at 2:56

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r569.

I fixed a couple small bugs introduced in the last revision.

Original comment by EmilyKJa...@gmail.com on 3 Feb 2014 at 5:16

GoogleCodeExporter commented 9 years ago
Fixed with r571

Original comment by torsten....@gmail.com on 4 Feb 2014 at 3:03

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 6 Feb 2014 at 9:56