Closed jswatsch closed 2 months ago
The NLTK stopword lists are available here: https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip. I usually have to augment these lists for my purposes, for instance by adding archaic words like "thee", "thou", etc., but at least NLTK is a known source and they are publicly available. And all the languages you've identified (and more) are included.
Alternatively, spaCy has an English stop words list that includes all contractions and a bunch of other words. It's here, though not as a list, but as (I think) a Python class: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py . But the words are exposed. I usually use either NLTK as John recommended or spaCy's list.
I'm not aware of any non-English stop word lists off the top of my head, but will look for some.
Currently we have place holders for four languages: English, Spanish, French, and German. We could add more if people think they are necessary.