memect / hao

好东西传送门
1.4k stars 459 forks source link

@V井颠V 对于国内各类中长文章(300到5000字左右),进行近似新闻门户网站的频道这个颗粒度的领域进行文章领域自动分类,有什么好的模型方法吗,国内有实践下来比较高的准确度的吗? #146

Closed haoawesome closed 10 years ago

haoawesome commented 10 years ago

http://web.mit.edu/6.863/www/fall2012/projects/writeups/newspaper-article-classifier.pdf Document Classification for Newspaper Articles mit course project

haoawesome commented 10 years ago

http://dl.acm.org/citation.cfm?id=1487024 Emotion Classification of Online News Articles from the Reader's Perspective (2008) WIC

haoawesome commented 10 years ago

http://en.wikipedia.org/wiki/Topic_model

http://en.wikipedia.org/wiki/Topic_model statistical topic model, UIUC 的chengxiang zhai 在这个方向造诣很深, Stanford 和Umass 都有软件包

haoawesome commented 10 years ago

haoawesome commented 10 years ago

Software / Libraries Mallet (software project) (http://mallet.cs.umass.edu/) Stanford Topic Modeling Toolkit (http://nlp.stanford.edu/software/tmt/tmt-0.4/) Gensim - Topic Modeling for Humans (http://radimrehurek.com/gensim/)

haoawesome commented 10 years ago

小平与六便士 :文章很多地方没讲清楚,比如dc和gama的threshold选取问题,除非是真正了解数据本质意义的人,不然恐怕很难做出很好的分类结果,而且并不是像文中所写的那样能够实现自动分类。不过这种将density和distance同时考虑的思路还是很不错的//@我爱机器学习: 质疑声一片啊,感觉作者有点招架不住了

http://weibo.com/2304615331/BclGr5bpL

haoawesome commented 10 years ago

@王威廉 :来自Twitter的KDD 2014论文: Large-Scale High-Precision Topic Modeling on Twitter http://t.cn/RPgclet 号称有93% precision。不过确实感觉是非常实际的工业界主题模型解决方案。

http://weibo.com/1657470871/Bk2H32WrR

http://www.eeshyang.com/papers/KDD14Jubjub.pdf

haoawesome commented 10 years ago

问: @V井颠V 对国内中长文章(300~5000字)近似新闻门户网站频道粒度的自动分类,有好的模型方法? 答: 资料整理 http://memect.co/T4iScgq 考虑statistical topic model, 推荐UIUC翟成祥短教程 http://weibo.com/5220650532/BhWo26Y93 ,软件包Gensim,Mallet,Stanford; kdd14有twitter分类好文 欢迎补充

http://www.weibo.com/5220650532/BmNjFtkeg?ref=