nltk / nltk_data

NLTK Data
1.46k stars 1.04k forks source link

wrong german stopwords in stopwords corpora #19

Open juh2 opened 9 years ago

juh2 commented 9 years ago

nltk_data/packages/corpora/stopwords.zip contains four wrong german stopwords:

unse
unsem
unsen
unses
alvations commented 7 years ago

The "non-words" raised by @juh2 should have been resolved in #49

>>> from nltk.corpus import stopwords
>>> deu_stops = stopwords.words('german')
>>> 'unse' in deu_stops
False
>>> 'unsem' in deu_stops
False
>>> 'unsen' in deu_stops
False
>>> 'unses' in deu_stops
False
>>> 'unsere' in deu_stops # valid stopwords.
True

But there are more stopwords missing for germans, to list a few:

>>> 'unserige' in deu_stops
False
>>> 'unserins' in deu_stops
False
>>> 'unseriner' in deu_stops
False
hebecked commented 3 years ago

"unserins" und "unseriner" are not German words. Do you mean "unsereins" and "unsereiner"?

stevenbird commented 2 years ago

Please propose a definitive list of German stopwords and I will update our list.