Add editing - Githubissues

quanteda / stopwords

Multilingual Stopword Lists in R

http://stopwords.quanteda.io

Other

113 stars 9 forks source link

Add editing #33

Closed kbenoit closed 4 years ago

kbenoit commented 4 years ago

Adds:

stopwords_edit() and char_edit() for interactive editing of character vectors or stopword lists (as R lists)
char_remove() with full quanteda pattern, valuetype, case_insensitive etc functionality. (@koheiw this is not stringi-based just to keep the package lean, and since we are likely not dealing with massive objects here that would need peak efficiency.)
re-export for magrittr::`%>%`

Updates the README pretty substantially.

Solves #14

Still needs tests, but I will wait for feedback first in case we change the functions. Once this is done, I think we should release the update as v2.0.

koheiw commented 4 years ago

I was not aware of #14 until now, but removing elements is very easy

setdiff(stopwords("en"), c("i", "me"))

Nobody needs to do

stopwords("en")[!stopwords("de") %in% c("i", "me")]

Is there any other reason to add stopwords_edti()?

kbenoit commented 4 years ago

That's still possible of course, but char_remove() adds removal via pattern matching.

stopwords_edit() is an interactive solution. @stefan-mueller and I found in our workshops that this was a high-demand operation but also one that students found difficult (those new to R anyway).

koheiw commented 4 years ago

We have stopwords package to prevent people from using random/arbitrary stop words list. Don't you think these functions are again the goal of the package?

I think they are. char_remove() should be in quanteda if you really want it. char_remove/select()` is only one line:

char <- stopwords::stopwords("en")

# remove
setdiff(char, unlist(quanteda::pattern2fixed("t*", char, valuetype = "glob", case_insensitive =TRUE)))

# select
unlist(quanteda::pattern2fixed("t*", char, valuetype = "glob", case_insensitive =TRUE))

char_select() would be useful for inspecting outputs from textstat_* and textmodel_*.

kbenoit commented 4 years ago

I'm happy with moving the char_*() functions into quanteda, it would be consistent with our naming and our functionality there.

The idea behind the editing function is that many people do modify stopword lists, for instance removing gendered pronouns, but find the sort of R code that we would use (such as that you wrote) to be beyond their ability. The stopwords_edit() is a quick and easy way to let them modify any of our sources. Yes we have approved versions (Snowball and now, NLTK) but we also have totally random lists in the "iso" and "misc" sources.

A variant of this could be a nice addition to dictionaries by the way... dictionary_edit() for instance added to quanteda.

stefan-mueller commented 4 years ago

I agree with @kbenoit. Removing stopwords is one of the most frequent questions by beginners, and a custom function will make it straightforward to adjust a stopword list based on the texts or research question. This might also help and encourage users to check more closely which words they are keeping/removing. I really like the idea of adding it these functions to quanteda, along with dictionary_edit() (another very useful function!).

koheiw commented 4 years ago

Then why don't you issue PR to add char_remove/select() to quanteda? Let's discuss dictionary_edit() separately.

kbenoit commented 4 years ago

Fixed instead in