Open rth opened 4 years ago
decide what should be the default stop word list: either take an english stop word list from somewhere (e.g. spacy), or ask users to explicitly provide one.
I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.
How about having separate preference defaults for different languages. Such as:
StopWordFilter::default("en")
Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).
I think it's useful to include sensible defaults such as a stop word list. It allows people to experiment with the package quicker. But it's also important not to be English/European language centric.
Absolutely. It's just that there is not clear consensus what should a stop word include/exclude and when one is provided people tend to use it without thinking too much (see e.g. this paper). I agree we can include stop word list for a few common world languages.
Additionally, I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).
+1
Interesting paper. Might be worth including a standard stop word list from spacy but add a note in the documentation that refers to the paper.
Add the
StopWordFilter
struct to filter stop words, as as example of aTokenProcessor
trait implementation that takes in an iterator and returns an iterator of strings (following discussion in https://github.com/rth/vtext/issues/21)TODO