baojie commented 10 years ago

@好东西传送门，请问目前有木有关于不平衡数据分类（imbalance dataset classification）任务的人工二维toy dataset，我有些关于下采样的初步想法，想在这样的toy数据上看看对不对，二维的主要是可以可视化，容易看。

http://www.weibo.com/1855519363/Bgph3cHlX?mod=weibotime

haoawesome commented 10 years ago

讨论 AixinSG：Undersampling 总体上效果有限，个人理解

刘知远THU: 不平衡数据分类，尤其是标注正例特别多，几乎没有标注负例，但有大量未标注数据的话，应当怎么处理呢？这个问题在relation extraction中很普遍。现在只能在大量未标注数据中随机抽样作为负例。

xierqi: 有段调研过这方面，90%都是采样，最大问题是评估方法不适合真实场景。个人推荐domingos的meta-cost，非常实用，经验设下cost就好。http://t.cn/RPiexE9

eacl_newsmth: 在关系抽取中，是正例特别多? 没有负例么？我怎么觉得很多情况下是正例有限，但负例很多（当然你也可以argue说负例其实很难界定）。。。。

刘知远THU：回复@eacl_newsmth: 就像knowledge graph中可以提供很多正例，但负例需要通过随机替换正例中的entity来产生，这样容易把也是正确的样例当成负例来看。

eacl_newsmth：回复@刘知远THU:恩，我估计你就要说这个例子，所以我在后面说，看你怎么界定负例，哈哈，我也纠结过好久，后来觉得其实还是正例少，而且很多时候你能保证正例是对的么？

刘知远THU：回复@eacl_newsmth: 正例基本是正确的，例如来自Freebase的，但负例对效果影响很大。:)今年AAAI有篇MSRA做的TransH的模型中，就提出一个负例选取的trick，效果拔群。

eacl_newsmth：回复@刘知远THU:恩，KB中的实例确实是正确的，但是依据这些实例去海量文档中寻找的那些样本未必是正确的啊。就目前的工作来看，确实很多在负例上做文章的工作都能把效率提升一些，去年语言所的一个学生利用“关系”特性，优选训练样本，也确实能提升性能。但单就这个问题而言，不能回避正例的可靠性

刘知远THU：回复@eacl_newsmth: 你说的这篇文章能告诉一下题目么？我现在关注的还不是从文本中抽关系，而是做knowledge graph completion，有点类似于graph上的link prediction，但要预测的link是有不同类型的relation。

eacl_newsmth：回复@刘知远THU:http://t.cn/RPX75A3 恩，看了你们那里一个小伙的talk，感觉和sebastian之前的工作很相关啊，也许是他表述的问题？啥时候回北京？可以好好讨论一下。

haoawesome commented 10 years ago

search keywords

Positive only Imbalanced data

readings

http://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf (@xierqi 推荐) Domingo, MetaCost: A General Method for Making Classifiers Cost

http://www.aclweb.org/anthology/P/P13/P13-2141.pdf (@eacl_newsmth 推荐) Towards Accurate Distant Supervision for Relational Facts Extraction

http://cseweb.ucsd.edu/~elkan/posonly.pdf Learning Classiﬁers from Only Positive and Unlabeled Data

http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf He and Haibo He, Edwardo A. Garcia . (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.

http://www.computer.org/csdl/proceedings/icnc/2008/3304/04/3304d192-abs.html Guo, X., Yin, Y., Dong, C., Yang, G., & Zhou, G. (2008). On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation (pp. 192-201).

tools

http://www.nltk.org/_modules/nltk/classify/positivenaivebayes.html nltk

http://weka.wikispaces.com/MetaCost Weka

haoawesome commented 10 years ago

datasets

http://pages.cs.wisc.edu/~dpage/kddcup2001/ Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin

https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table UCI dataset repo, classification category

haoawesome commented 10 years ago

关于不平衡数据分类(Imbalanced data classification)，整理了一个稿子，看看还有没有需要补充的 https://github.com/memect/hao/blob/master/awesome/imbalanced-data-classification.md

相关讨论纪录: https://github.com/memect/hao/issues/47

haoawesome commented 10 years ago

[资源整理] 不平衡数据分类(Imbalanced data classification): http://memect.co/hIYTr7R 经典文献 MetaCost (Domingo, 1999), SMOTE(2002 Chawla), 以及2004 CMU Yanjun Qi 的综述(现UVA教授)；工具与数据集（WEKA,NLTK), GITHUB SMOTE的实现。感谢 @AixinSG @刘知远THU @xierqi @eacl_newsmth

http://www.weibo.com/5220650532/BiZQEloKK?ref=#_rnd1408426979569

haoawesome commented 10 years ago

好东西传送门：回复@朱小强_Bigeye_THU: http://t.cn/RP8jyzY "The most interesting compromise in terms of model complexity and AUC is MetaCost using PART as the base classification algorithm. AdaBoost yields higher AUC values but high complexity models."

memect / hao

@eastone01 请问目前有木有关于不平衡数据分类（imbalance dataset classification）任务的人工二维toy dataset #47

search keywords

readings

tools