gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Corex with large data #29

Closed toth12 closed 4 years ago

toth12 commented 4 years ago

I would like to use corex to extract topics from a 60 million word corpus divided into chunks of 100 words, I am wondering how to scale COREX so that it can cope with a document - term matrix with 600.000 rows.

gregversteeg commented 4 years ago

It takes as input Scipy sparse matrix format (CSR format, I think, it's in the doc string). In sparse matrix format the entire dataset should be easily loaded in memory.

Unfortunately, it's not easily adapted to batches. One person implemented a version of linear corex (as opposed to this one, which assumes everything is binary) in tensorflow: https://github.com/hrayrhar/T-CorEx

toth12 commented 4 years ago

@gregversteeg thanks for the quick reply, please close the issue.