memect / hao

好东西传送门
1.4k stars 459 forks source link

stopword 这个讨论很有意义,明天小门会帮着整理合集,请各位专家继续 #242

Closed haoawesome closed 9 years ago

haoawesome commented 10 years ago

http://www.weibo.com/5220650532/Bp5joiZta

haoawesome commented 10 years ago

概念

http://en.wikipedia.org/wiki/Stop_words In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search.

haoawesome commented 10 years ago

@AixinSG 相对于常规网页或新闻,我觉得停用词在用户生成内容里面会更重要一些,现在更倾向于在索引中保留每个词。Stop stopping stop words: a look at Common Terms Query http://t.cn/Rh8DFRh (2)| 转发(27) | 评论(5) 9月28日 08:24

http://www.weibo.com/1025887594/Bp2RkCBrH

http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/ Stop stopping stop words: a look at Common Terms Query

haoawesome commented 10 years ago

昊奋:停用词是一个相对概念,旨在代表那些没有实际含义或代表极少含义的词语,去除这些词对理解或处理影响不大。个人认为与其考虑去除停用词,不让去做关键词识别。 (9月28日 09:23)

http://www.weibo.com/2045933955/Bp3f3mn6K

haoawesome commented 10 years ago

章成志:是的,要看具体场合,实际上,“停用词”这个概念来源于信息检索、文本分类这样的任务,通常那些区分性较低(idf低)的词很多就是停用词,如果做情感分类等任务,有些词不但不能停用反而很重要。//@昊奋: 停用词是一个相对概念,旨在代表那些没有实际含义或代表极少含义的词语,去除这些词对理解或处理 (1) | 转发(5) 9月28日 09:32 http://www.weibo.com/1810879314/Bp3iSB7JF

haoawesome commented 10 years ago

AixinSG:Lucene 自定义的英文停用词还是相对非常保守的,也就三十几个 a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with 有的停用词列表包含500多词 http://t.cn/RhRNxkk

http://www.weibo.com/1025887594/Bp5N3tItB

http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

haoawesome commented 10 years ago

西瓜大丸子汤:“大数据”现在是我的停用词,但“超大数据”不是 [嘻嘻] 另外我发现对于机器学习,“计算机”“人工智能”“数据挖掘”都要列入停用词。//@AixinSG: Lucene 自定义的英文停用词还是相对非常保守的 (1) | 转发(4) 9月28日 16:36

http://www.weibo.com/1932835417/Bp65003pA

haoawesome commented 10 years ago

数盟社区:[挖鼻屎]你的停用词真多啊 //@西瓜大丸子汤: “大数据”现在是我的停用词,但“超大数据”不是 [嘻嘻] 另外我发现对于机器学习,“计算机”“人工智能”“数据挖掘”都要列入停用词。//@AixinSG: Lucene 自定义的英文停用词还是相对非常保守的 http://www.weibo.com/3847741679/Bp691F1A5

haoawesome commented 10 years ago

http://members.unine.ch/jacques.savoy/Papers/SavoyStopList.pdf When Stopword Lists Make the Difference “For the English language, a short stopword list (9 words) usually results in performance levels similar to a longer one (571 words)“

haoawesome commented 10 years ago

http://google.com/patents/US8352469 Automatic generation of stop word lists for information retrieval and analysis

US 8352469 B2 Abstract Methods and systems for automatically generating lists of stop words for information retrieval and analysis. Generation of the stop words can include providing a corpus of documents and a plurality of keywords. From the corpus of documents, a term list of all terms is constructed and both a keyword adjacency frequency and a keyword frequency are determined. If a ratio of the keyword adjacency frequency to the keyword frequency for a particular term on the term list is less than a predetermined value, then that term is excluded from the term list. The resulting term list is truncated based on predetermined criteria to form a stop word list.

haoawesome commented 10 years ago

讨论242 不完全整理贴 https://github.com/memect/hao/issues/242 补充了一篇论文 When Stopword Lists Make the Difference 一个很好玩的发现,英文里9个词的stopword list 与500多词的单子效果差异不大,法语类似。至于中文 ...还希望专家多讲讲

http://www.weibo.com/5220650532/Bpe3p9Ien?mod=weibotime