openbigdatagroup / plda

PLDA: Parallel Latent Dirichlet Allocation in C++
http://openbigdatagroup.github.io/plda
Apache License 2.0
85 stars 31 forks source link

Introduction

Welcome to PLDA.

PLDA is a parallel C++ implementation of Latent Dirichlet Allocation (LDA) [1,2]. We are expecting to present a highly optimized parallel implemention of the Gibbs sampling algorithm for the training/inference of LDA [3]. The carefully designed architecture is expected to support extensions of this algorithm.

We will release an enhanced parallel implementation of LDA, named as PLDA+ [2], which can improve scalability of LDA by significantly reducing the unparallelizable communication bottleneck and achieve good load balancing.

PLDA is a free software. Please see COPYING for details.

Requirement

Parallel lda must be run in linux environment with g++ compiler and mpich installed.

Quick Start

Installation

Install MPICH

Install PLDA

Data Format

Usage

Train

Infer

Example

Here we provide an simple example using the PLDA and New York Times news articles to train a topic model.

Topic
0
Topic
1
Topic
2
Topic
3
Topic
4
Topic
5
Topic
6
Topic
7
Topic
8
Topic
9
law president food government patient com company team show school
case election cup official drug web million game film student
court political water military doctor computer market season movie children
lawer campaign restaurant attack cell site stock player music family
police vote oil war disease www business coach actor home
... ... ... ... ... ... ... ... ... ...

Citation

If you wish to publish any work based on PLDA, please cite our paper as:

Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, Maosong Sun, PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning. 2011. Software available at http://code.google.com/p/plda.

The bibtex format is:

@article{
  plda,
  author = {Zhiyuan Liu and Yuzhou Zhang and Edward Y. Chang and Maosong Sun},
  title = {PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing},
  year = {2011},
  journal = {ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning},
  note = {Software available at \url{https://github.com/obdg/plda}}
}

If you have any questions, please visit https://github.com/obdg/plda

References

[1] PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, Maosong Sun. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning. 2011.

http://dl.acm.org/citation.cfm?id=1961198

[2] PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications. Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. AAIM 2009.

http://dl.acm.org/citation.cfm?id=1574062

[3] Latent Dirichlet Allocation, Blei et al., JMLR (3), 2003.

http://ai.stanford.edu/~ang/papers/jair03-lda.pdf

[4] Finding scientific topics, Griffiths and Steyvers, PNAS (101), 2004.

http://www.pnas.org/content/101/suppl.1/5228.full.pdf

[5] Fast collapsed gibbs sampling for latent dirichlet allocation, Porteous et al., KDD 2008.

http://portal.acm.org/citation.cfm?id=1401960

[6] Distributed Inference for Latent Dirichlet Allocation, Newman et al., NIPS 2007.

http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf

Papers using PLDA code:

[7] Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior. Wen-Yen Chen et al., WWW 2009.

http://dl.acm.org/citation.cfm?id=1526801