Closed findmyway closed 3 years ago
There are default stopwords lists WordCloud.stopwords_en
, WordCloud.stopwords_cn
and WordCloud.stopwords
, yet "will" is not included. I suppose it's because "will" can be a notional word.
You can customize it:
using WordCloud
using HTTP
url = "https://pretalx.com/juliacon2021/featured/"
text = url |> HTTP.get |> HTTP.body |> String |> html2text
wc = wordcloud(processtext(text, stopwords = WordCloud.stopwords_en ∪ ["will"]))
I see. Thanks!
By the way, where are the sources of these stopwords? I believe will
is usually removed in libraries like NLTK.
I "borrowed" it from the Python's wordcloud package😄. It's a very short list, maybe not the best. As for Chinese, things get more complicated. Do you know a better list?
Yeah, for Chinese, the performance relies on a good word breaker. The Chinese stop words look fine after a quick glimpse.
And what about English stop words (what I really meant to ask)? It looks like you know something better.
You can take a look at the stopwords used in ElasticSearch:
At least, will
is contained in it.
I just give it a try here: https://discourse.julialang.org/t/seven-lines-of-julia-examples-sought/50416/117?u=findmyway
It seems some stop words like
will
appear in the picture. Can we remove those by default? I think the same goes for Chinese words.