Closed toth12 closed 4 years ago
It takes as input Scipy sparse matrix format (CSR format, I think, it's in the doc string). In sparse matrix format the entire dataset should be easily loaded in memory.
Unfortunately, it's not easily adapted to batches. One person implemented a version of linear corex (as opposed to this one, which assumes everything is binary) in tensorflow: https://github.com/hrayrhar/T-CorEx
@gregversteeg thanks for the quick reply, please close the issue.
I would like to use corex to extract topics from a 60 million word corpus divided into chunks of 100 words, I am wondering how to scale COREX so that it can cope with a document - term matrix with 600.000 rows.