quanteda / stopwords

Multilingual Stopword Lists in R
http://stopwords.quanteda.io
Other
113 stars 9 forks source link

organise data more consistently by source type #2

Closed kbenoit closed 6 years ago

kbenoit commented 6 years ago

Standard language IDs are are not the same as standard word lists

For instance, Snowball maintains a relatively consistent set of stopwords across languages. This list is already pretty substantial, and includes most of the ones already used in quanteda, and forms (I think) the list from NLTK.

From the hierarchy of http://snowball.tartarus.org/dist/snowball_all.tgz, there can be found all of the language lists in quanteda::data_char_stopwords, except the relatively recent additions of Chinese, Catalan, and Arabic. (There is also Romania and Irish, if you know where to look, inside the Snowball tarball.) See also here.

The word lists at various "ISO" repositories purport to be some sort of "standard" list, but the only thing standard about them is the usage of the ISO-639 language identifiers. The actual word lists vary highly in terms of quality, length, and scope - see the sources at https://github.com/stopwords-iso/stopwords-iso/blob/master/CREDITS.md.

HOWEVER given the prominence of the https://github.com/stopwords-iso collection, it seems to make some sense to keep this as a "source".

kbenoit commented 6 years ago

Done in dev-quanteda now.