roamanalytics / mittens

A fast implementation of GloVe, with optional retrofitting
Apache License 2.0
243 stars 31 forks source link

how to use sparse matrix with mittens #2

Closed huashan closed 5 years ago

huashan commented 6 years ago

I stored the co-occurrence matrix in MatrixMarket format and read into python with mmread() , do I have to convert it as dense matrix (which is impossible for memory issue)? Or does Mittens handle with this format?

ndingwall commented 6 years ago

Apologies for the slow reply - for some reason I haven't been receiving notifications.

At the moment, yes, you'd have to convert it to a dense format. If you can deal with reducing the vocabulary (e.g. by discarding rows and columns corresponding to the least-frequently occurring tokens) then that's great. If not, then you're in trouble: a key step is computing the difference between the log co-occurrence matrix (which could be sparse) and the outer product of W and C (the learned word and context vectors). (See https://github.com/roamanalytics/mittens/blob/master/mittens/np_mittens.py#L109) Both W and C have to be dense, and so their outer product (which is the same dimension as the co-occurrence matrix) will be dense too. That means that if the co-occurrence matrix doesn't fit in memory, you'll hit that memory limit regardless of how you represent it.

Since we eventually only need gradients that are the same shape as W and C (V*d), it might be possible to rewrite the code to compute those gradients in batches, but it'll be slow(er).