rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Add StopWordFilter #78

Open rth opened 4 years ago

rth commented 4 years ago

Add the StopWordFilter struct to filter stop words, as as example of a TokenProcessor trait implementation that takes in an iterator and returns an iterator of strings (following discussion in https://github.com/rth/vtext/issues/21)

TODO

joshlk commented 4 years ago

decide what should be the default stop word list: either take an english stop word list from somewhere (e.g. spacy), or ask users to explicitly provide one.

I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.

How about having separate preference defaults for different languages. Such as:

StopWordFilter::default("en")

Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

rth commented 4 years ago

I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.

Absolutely. It's just that there is not clear consensus what should a stop word include/exclude and when one is provided people tend to use it without thinking too much (see e.g. this paper). I agree we can include stop word list for a few common world languages.

Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

+1

joshlk commented 4 years ago

Interesting paper. Might be worth including a standard stop word list from spacy but add a note in the documentation that refers to the paper.