htrc / torchlite-frontend

Torchlite web interface
https://torchlite-htrc.vercel.app
1 stars 0 forks source link

Identify exemplary stop word lists to use as default lists #116

Closed jswatsch closed 1 week ago

jswatsch commented 1 month ago

Currently we have place holders for four languages: English, Spanish, French, and German. We could add more if people think they are necessary.

jawalsh commented 1 month ago

The NLTK stopword lists are available here: https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip. I usually have to augment these lists for my purposes, for instance by adding archaic words like "thee", "thou", etc., but at least NLTK is a known source and they are publicly available. And all the languages you've identified (and more) are included.

rdubnic2 commented 1 month ago

Alternatively, spaCy has an English stop words list that includes all contractions and a bunch of other words. It's here, though not as a list, but as (I think) a Python class: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py . But the words are exposed. I usually use either NLTK as John recommended or spaCy's list.

I'm not aware of any non-English stop word lists off the top of my head, but will look for some.