undertheseanlp / underthesea

Underthesea - Vietnamese NLP Toolkit
http://undertheseanlp.com
GNU General Public License v3.0
1.41k stars 273 forks source link

Stopwords or no ? #193

Closed hiimdoublej closed 6 years ago

hiimdoublej commented 6 years ago

Dear developers, thanks for developing this project, it's working really well, except for one thing, that I don't know if I need to use a stopwords filter before tokenizing Vietnamese sentences with underthesea. I have found a Vietnamese stopwords list here but just don't know if I should use it or not. So my questions are

  1. Does using stopwords list like this improves my corpus? I don't understand Vietnamese so I can't really tell.
  2. Do you filter out stopwords during pre-processing of raw Vietnamese data? Any help will be appreciated. Thanks
rain1024 commented 6 years ago

About the first question, what is your corpus? Is your corpus for word segmentation problem, pos tag problem, named entity recognition problem or classification problem?

At the moment, I think all corpus for these problems use raw text (with out pre-processing)

The second question, I don't filter stopwords when pre-processing. But I use some feature extraction technique (such as tfidf) to extract import words in classification problem.

hiimdoublej commented 6 years ago

My corpus is used for word segmentation only, I'm trying to collect trending search keywords in a certain Vietnamese website. So, are you suggesting me that I would not need to filter out the stop words before segmenting my corpus ? Thanks again.

rain1024 commented 6 years ago

In case you are constructing a Vietnamese word segmentation corpus in online news paper domain. There are some high quality corpus available. Please check its in word_tokenize repository of underthesea.

If you build corpus for free text (such as posts in forums or social networks), I think you need filter some junk characters (such as emotion icons, punctuation marks...), but you shouldn't remove stop words.

hiimdoublej commented 6 years ago

@rain1024 Thank you for the detailed reply ! I will try your suggestions, thank you !