memect / hao

好东西传送门
1.4k stars 459 forks source link

Daniel_NEURO2NLP:您好!有关于挖掘话题层级结构(topic hierarchies,类似于领域知识图谱,但是是只有hypernym-hyponym一种关系的DAG)的研究和应用吗?谢谢了! #199

Closed haoawesome closed 10 years ago

haoawesome commented 10 years ago

私信

haoawesome commented 10 years ago

概念: topic hierarchy

http://en.wikipedia.org/wiki/Hyponymy_and_hypernymy In linguistics, a hyponym is a word or phrase whose semantic field[1] is included within that of another word, its hypernym (sometimes spelled hyperonym outside of the natural language processing community). Computer science often terms this relationship an "is-a" relationship. For example, the phrase "Red is-a colour" can be used to describe the hyponymic relationship between red and colour.

相关概念:ontology learning, statistical relation learning

http://en.wikipedia.org/wiki/Ontology_learning Ontology learning (ontology extraction, ontology generation, or ontology acquisition) is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between those concepts from a corpus of natural language text, and encoding them with an ontology language for easy retrieval.

http://en.wikipedia.org/wiki/Statistical_relational_learning Statistical relational learning (SRL) is a subdiscipline of artificial intelligence and machine learning that is concerned with models of domains that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure.

http://en.wikipedia.org/wiki/Hierarchical_clustering In data mining, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.

haoawesome commented 10 years ago

http://digitalcommons.fiu.edu/etd/1517/ Towards Next Generation Vertical Search Engines Li Zheng PhD Thesis 比较全面的领域调研 screen shot 2014-09-17 at 10 29 00 am

haoawesome commented 10 years ago

https://www.cs.princeton.edu/~blei/papers/BleiGriffithsJordan2009a.pdf David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. 2010. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM 57, 2, Article 7 (February 2010)

screen shot 2014-09-17 at 10 25 39 am

haoawesome commented 10 years ago

http://dl.acm.org/citation.cfm?id=1858805 Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2010. Global learning of focused entailment graphs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10) screen shot 2014-09-17 at 10 31 58 am

haoawesome commented 10 years ago

http://dl.acm.org/citation.cfm?id=2002866 Yves Petinot, Kathleen McKeown, and Kapil Thadani. 2011. A hierarchical model of web summaries. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2 (HLT '11)

quote from paper

2 Related Work While several efforts have focused on the DMOZ corpus, often as a reference for Web summarization tasks (Berger and Mittal, 2000; Delort et al., 2003) or Web clustering tasks (Ramage et al., 2009b), very little research has attempted to make use of its hier- archy as is. The work by Sun et al. (2005), where the DMOZ hierarchy is used as a basis for a hierar- chical lexicon, is closest to ours although their con- tribution is not a full-fledged content model, but a selection of highly salient vocabulary for every cat- egory of the hierarchy. The problem considered in this paper is connected to the area of Topic Modeling (Blei and Lafferty, 2009) where the goal is to reduce the surface complexity of text documents by mod- eling them as mixtures over a finite set of topics . While the inferred models are usually flat, in that no explicit relationship exists among topics, more complex, non-parametric, representations have been proposed to elicit the hierarchical structure of vari- ous datasets (Hofmann, 1999; Blei et al., 2010; Li et al., 2007). Our purpose here is more specialized and similar to that of Labeled LDA (Ramage et al., 2009a) or Fixed hLDA (Reisinger and Pa ̧sca, 2009) where the set of topics associated with a document is known a priori . In both cases, document labels are mapped to constraints on the set of topics on which the - otherwise unaltered - topic inference algorithm is to be applied. Lastly, while most recent develop- ments have been based on unsupervised data, it is also worth mentioning earlier approaches like Topic Signatures(Lin and Hovy, 2000) where words (or phrases) characteristic of a topic are identified using a statistical test of dependence. Our first model ex- tends this approach to the hierarchical setting, build- ing actual topic models based on the selected vocab- ulary.

haoawesome commented 10 years ago

http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2003_AA03.pdf Hierarchical topic models and the nested Chinese restaurant process (2004)D Griffiths, M Tenenbaum screen shot 2014-09-17 at 10 36 39 am

haoawesome commented 10 years ago

http://cs.brown.edu/~th/papers/Hofmann-IJCAI99.pdf The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data (IJCAI99) T Hofmann

screen shot 2014-09-17 at 10 31 06 am

haoawesome commented 10 years ago

http://research.microsoft.com/en-us/um/people/shliu/brt.pdf Automatic Taxonomy Construction from Keywords

screen shot 2014-09-17 at 2 25 12 pm

haoawesome commented 10 years ago

Microsoft ProBase

http://research.microsoft.com/en-us/projects/probase/statistics.aspx 微软的ProBase

Table 1: Scale comparison of several open domain taxonomies

name # of concepts # of isA pairs Freebase 1,450 24,483,434 WordNet 25,229 283,070 WikiTaxonomy 111,654 105,418 YAGO 352,297 8,277,227 DBPedia 259 1,900,000 ResearchCyc ≈ 120,000 < 5,000,000 KnowItAll N/A < 54,753 TextRunner N/A < 11,000,000 OMCS 173,398 1,030,619 NELL 123 < 242,453 Probase 2,653,872 20,757,545

haoawesome commented 10 years ago

How to Grow a Mind: Statistics, Structure, and Abstraction (Science 2011)

http://www.sciencemag.org/content/331/6022/1279.short How to Grow a Mind: Statistics, Structure, and Abstraction

PDF: http://web.mit.edu/cocosci/Papers/tkgg-science11-reprint.pdf

In coming to understand the world —in learning concepts, acquiring language, and grasping causal relations — our minds make inferences that appear to go far beyond the data available. How do we do it? This review describes recent approaches to reverse-engineering human learning and cognitive development and, in parallel, engineering more humanlike machine learning systems. Computational models that perform probabilistic inference over hierarchies of flexibly structured representations can address some of the deepest questions about the nature and origins of human thought: How does abstract knowledge guide learning and reasoning from sparse data? What forms does our knowledge take, across different domains and tasks? And how is that abstract knowledge itself acquired?

screen shot 2014-09-17 at 3 13 29 pm

screen shot 2014-09-17 at 3 10 01 pm image source: http://research.microsoft.com/en-us/projects/probase/probase.apr.2013.pdf talk 2013

haoawesome commented 10 years ago

http://www.eeshyang.com/papers/KDD14Jubjub.pdf Large-Scale High-Precision Topic Modeling on Twitter (KDD‘14)

screen shot 2014-09-17 at 3 43 21 pm

haoawesome commented 10 years ago

问: 关于挖掘话题层级结构(topic hierarchy)的研究和应用? 答: http://memect.co/oSkqJ-V 早期有CAM模型(IJCAI'99), 近来有Blei基于"bayesian nonparametric inference"的工作, Berant的"entailment graph", 微软ProBase. Twitter用它分类(kdd'14). 认知科学看"How to Grow a Mind"(science'11) 欢迎指正

http://www.weibo.com/5220650532/BnvY6x7Oq?ref=

darcher005 commented 10 years ago

http://arxiv.org/pdf/1210.6738.pdf Nested hierarchical dirichlet processes (2014) 沿着《Hierarchical topic models and the nested Chinese restaurant process 》的思路,又有一篇模型限制更少的文章。 摘要: We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP generalizes the nested Chinese restaurant process (nCRP) to allow each word to follow its own path to a topic node according to a document-specific distribution on a shared tree. This alleviates the rigid, single-path formulation assumed by the nCRP, allowing documents to easily express complex thematic borrowings. We derive a stochastic variational inference algorithm for the model, which enables efficient inference for massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 2.7 million documents from Wikipedia.

qq 20140918174124

haoawesome commented 10 years ago

@darcher005 非常感谢,这个很好呀! 要不你发个微博给同学们,我们帮你传送。 私信通知我们。