scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.37k stars 25.24k forks source link

More detailed instructions needed for making (non-English) stop word lists compatible #17292

Open MonikaBarget opened 4 years ago

MonikaBarget commented 4 years ago

This issue relates to #10735 re: the improvement of stop word lists.

As stated in the previous discussion, a more detailed documentation for making custom stop-word lists compatible would be more helpful than an updated in-built stop-word list as use cases even within the English language can be very specific.

One of my corpora is a collection of Hiberno-English letters using many outdated word forms as well as words derived from Irish-Gaelic. The txt-files also contain occassional UTF-8 errors and orphaned XML tags inherited from the original documents.

While I acknowledge that removing "lb" for unresolved line-break tags may best be done in the pre-processing, I am also having issues with word forms such as "we'll", "won't" or "'tis" which use apostrophes. Besides, appreviated words such as "oct" for "October" seem to be problematic.

I have changed my stopword list multiple times to include both word forms with and without apostrophes, but although both "October and "Oct" are on my list, "oct" is still being ignored.

Here is a sample script that might help to update the documentation:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

docs=["Tirconaill tír st sráid Oct 30th 22 A Chait A chara Dhílis 20th I you your me mine faith faithful faithfully ye get got we'll I'll 'tis le length legacy tá tú bhí sé sí &amp amp am"]                                
my_stopwords=["'tis","tis","a", "amp", "30th","Oct","22","ye","I","you","20th","me","get","st", "sráid", "we'll", "ll", "le", "tú", "bhí", "faithfully", "tír"]

print(CountVectorizer(stop_words=my_stopwords).fit_transform(docs).A)

vectorizer=text.CountVectorizer(input='docs', stop_words=my_stopwords) 

dtm=vectorizer.fit_transform(docs).toarray() 

vocab=np.array(vectorizer.get_feature_names())

print(dtm.shape)
print(vocab)
print(len(vocab))

num_topics=2

num_top_words=15

clf=decomposition.NMF(n_components=num_topics, random_state=1)

doctopic=clf.fit_transform(dtm) 

topic_words=[] 
for topic in clf.components_:

    word_idx=np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

print(topic_words) 

The output I got was this:

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]] (1, 17) ['am' 'chait' 'chara' 'dhílis' 'faith' 'faithful' 'got' 'legacy' 'length' 'mine' 'oct' 'sé' 'sí' 'tirconaill' 'tá' 'we' 'your'] 17 [['chait', 'mine', 'sé', 'tá', 'got', 'sí', 'chara', 'length', 'we', 'tirconaill', 'your', 'am', 'legacy', 'faithful', 'oct'], ['faith', 'dhílis', 'oct', 'faithful', 'legacy', 'am', 'your', 'tirconaill', 'we', 'length', 'chara', 'sí', 'got', 'tá', 'sé']]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['oct', 'we'] not in stop_words. 'stop_words.' % sorted(inconsistent))

MonikaBarget commented 4 years ago

PS: the most difficult word to remove is perhaps "won't" as it comes out as "won" if I have "won't" on the stop word list. To remove "won't", my stop word list needs to include "won", which I would like to keep as the past tense of "win".

INPUT:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

docs=["Oct october lb gt st this is a sample script to test if it all works just fine won't"]                                
my_stopwords=["Oct","'tis","October", "lb", "gt", "st", "won't"]

print(CountVectorizer(stop_words=my_stopwords).fit_transform(docs).A)

vectorizer=text.CountVectorizer(input='docs', stop_words=my_stopwords) 

dtm=vectorizer.fit_transform(docs).toarray() 

vocab=np.array(vectorizer.get_feature_names())

print(dtm.shape)
print(vocab)
print(len(vocab))

num_topics=2

num_top_words=15

clf=decomposition.NMF(n_components=num_topics, random_state=1)

doctopic=clf.fit_transform(dtm) 

topic_words=[] 
for topic in clf.components_:

    word_idx=np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

print(topic_words) 

OUTPUT:

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]] (1, 15) ['all' 'fine' 'if' 'is' 'it' 'just' 'oct' 'october' 'sample' 'script' 'test' 'this' 'to' 'won' 'works'] 15 [['is', 'this', 'fine', 'won', 'sample', 'works', 'it', 'test', 'if', 'script', 'october', 'to', 'all', 'just', 'oct'], ['oct', 'just', 'all', 'to', 'october', 'script', 'if', 'test', 'it', 'works', 'sample', 'won', 'fine', 'this', 'is']]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['oct', 'october', 'tis', 'won'] not in stop_words. 'stop_words.' % sorted(inconsistent))

amueller commented 4 years ago

@MonikaBarget you can change the tokenizer to include the apostrophe by changing the token_pattern. I thought we added that to the docstrings but I can't find it right now. '\b\w[\w']+\b' or something like that should work.

Apparently we decided against including it in the docs: #7008

amueller commented 4 years ago

I think the main issue you're running up against is that stopwords are matched after normalization, so they need to be lowercase. That probably needs to be pointed out more explicitly in the docs as lower_case=True by default.

rth commented 4 years ago

There are generally two issues here,

MonikaBarget commented 4 years ago

Thanks so much, that is very helpful. @amueller , yes, pointing out the case-sensitivity would be great. I think MALLET also requires lower-case stopwords by default, but I didn't think of it. And I will check out the tokenizer recommendations in the comments. Shall we close the issue?

jnothman commented 4 years ago

We know this is a tricky space to get right. We got stuck on it for a while, too, and don't feel it's entirely resolved. (See this paper by some of us.) I really appreciate having your challenges documented in public so that others can learn from them.

Would you like to offer small improvements to our documentation regarding the interaction between stop_words and lower_case (and tokenizer) before we close?

MonikaBarget commented 4 years ago

@jnothman : yes, I would very much like to make a suggestion.

MonikaBarget commented 4 years ago

Brief update: I have experimented with the default stopword lists in NLTK and built my own multilingual stopword lists based on theirs. I have to say that those worked best for my historical sources, and I found ingesting my own stopwords via NLTK really easy. What I did was search for the nltk_data folder on my computer, which has "corpora" and "stopwords" subfolders. In "stopwords", I simply added my own lists and could call them by name from the standard NLTK stopword function.

And I have also experimented with extended text-pre-processing prior to tokenization ... and no tokenization at all. Especially early modern German, which has very different spelling and obscure verb forms, was a pain to handle with almost any tokenizer I tried.

It will still take me some time to come up with suggestions for your documentation, but I am working on it.