guo-yong-zhi / WordCloud.jl

word cloud generator in julia
MIT License
105 stars 2 forks source link

Remove stop words? #7

Closed findmyway closed 3 years ago

findmyway commented 3 years ago

I just give it a try here: https://discourse.julialang.org/t/seven-lines-of-julia-examples-sought/50416/117?u=findmyway

It seems some stop words like will appear in the picture. Can we remove those by default? I think the same goes for Chinese words.

guo-yong-zhi commented 3 years ago

There are default stopwords lists WordCloud.stopwords_en , WordCloud.stopwords_cn and WordCloud.stopwords, yet "will" is not included. I suppose it's because "will" can be a notional word. You can customize it:

using WordCloud
using HTTP
url = "https://pretalx.com/juliacon2021/featured/"
text = url |> HTTP.get |> HTTP.body |> String |> html2text
wc = wordcloud(processtext(text, stopwords = WordCloud.stopwords_en ∪ ["will"]))
findmyway commented 3 years ago

I see. Thanks!

By the way, where are the sources of these stopwords? I believe will is usually removed in libraries like NLTK.

guo-yong-zhi commented 3 years ago

I "borrowed" it from the Python's wordcloud package😄. It's a very short list, maybe not the best. As for Chinese, things get more complicated. Do you know a better list?

findmyway commented 3 years ago

Yeah, for Chinese, the performance relies on a good word breaker. The Chinese stop words look fine after a quick glimpse.

guo-yong-zhi commented 3 years ago

And what about English stop words (what I really meant to ask)? It looks like you know something better.

findmyway commented 3 years ago

You can take a look at the stopwords used in ElasticSearch:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html#analysis-stop-tokenfilter-stop-words-by-lang

At least, will is contained in it.