Closed hiimdoublej closed 6 years ago
About the first question, what is your corpus? Is your corpus for word segmentation problem, pos tag problem, named entity recognition problem or classification problem?
At the moment, I think all corpus for these problems use raw text (with out pre-processing)
The second question, I don't filter stopwords when pre-processing. But I use some feature extraction technique (such as tfidf) to extract import words in classification problem.
My corpus is used for word segmentation only, I'm trying to collect trending search keywords in a certain Vietnamese website. So, are you suggesting me that I would not need to filter out the stop words before segmenting my corpus ? Thanks again.
In case you are constructing a Vietnamese word segmentation corpus in online news paper domain. There are some high quality corpus available. Please check its in word_tokenize repository of underthesea.
If you build corpus for free text (such as posts in forums or social networks), I think you need filter some junk characters (such as emotion icons, punctuation marks...), but you shouldn't remove stop words.
@rain1024 Thank you for the detailed reply ! I will try your suggestions, thank you !
Dear developers, thanks for developing this project, it's working really well, except for one thing, that I don't know if I need to use a stopwords filter before tokenizing Vietnamese sentences with underthesea. I have found a Vietnamese stopwords list here but just don't know if I should use it or not. So my questions are