maciejkula / glove-python

Toy Python implementation of http://www-nlp.stanford.edu/projects/glove/
Apache License 2.0
1.25k stars 319 forks source link

Out of core computation #27

Closed maciejkula closed 8 years ago

maciejkula commented 9 years ago

This PR allows out-of-core computation of both the co-occurrence matrix and word vectors by using memory-mapped numpy arrays.

This should result in memory usage constant in the size of the training corpus, at the cost of a performance penalty related to the amount of memory available and the speed of persistent storage used.

@piskvorky

piskvorky commented 9 years ago

Cool! So you went with memmap rather than streaming?

I'll check this on wiki again :)

maciejkula commented 9 years ago

I guess there are two sides of looking at it.

From the 'training Glove on a prepared matrix' side, streaming seems like a natural thing to do, as we only need to see co-occurrence pairs one at a time. But then memmap is streaming from disk, just with a convenient interface and the kernel doing all the hard work. I think the alternative of a user passing in a pair iterator would be quite slow, as we would incur Python-interfacing costs at every iteration. Parallelism would be tricky as well.

From the 'preparing the co-occurrence matrix' side, true streaming is ruled out by the global nature of the algorithm, as every sentence processed can modify any point in the co-occurrence matrix. So I keep going over the entire matrix, and memmap is just a convenient interface of doing so.

piskvorky commented 9 years ago

I meant streaming over vectors (~co-occurrence matrix rows), as the GloVe algo seemed intrinsically streamed in nature = processing one input vector at a time. Restricting the stream format to NumPy-arrays-only is fine of course, just less flexible.

I think building the co-occurrence matrix should be easy to stream too, via map/reduce, but then the question is, what framework to use? Extra dependencies &c...

Or maybe just simulate M/R via disk memory and file sorting (sort & uniq), just like Levy & Goldberg did in their implementation? Point is not to keep the entire matrix in RAM at any one point.

Anyway, just random ideas.

The implementation is starting to look less and less like a "toy", maybe time to reconsider the Github tagline :)

maciejkula commented 9 years ago

Hmm, I always thought of it as working one entry at a time. One problem with a vector at a time is you'd be operating on the row word all the time --- a problem for lock-free parallel SGD, it'd keep overwriting itself.

Building the co-occurrence matrix is stupidly parallelizable, you just do it separately over parts of the corpus and add the matrices at the end (assuming you use a consistent dictionary). But how do do this cleanly is not entirely clear to me; I am tempted to add this parallelism note to a docstring and leave doing this to the user.

This PR does not keep the matrix in memory any more, that's the whole point! :)

piskvorky commented 9 years ago

OK OK I didn't check the new COO matrix in detail. I thought it still used some in-memory counts. Now I see you put extra cleverness + in-memory caching in there.

Looking at the code -- won't there be dangling descriptors after each mmap resize? I'm not sure how NumPy handles munmap during GC, and I don't see explicit releases anywhere.

maciejkula commented 9 years ago

I think numpy closes the file immediately after the memory map object is created: https://github.com/numpy/numpy/blob/master/numpy/core/memmap.py#L281

piskvorky commented 9 years ago

Hmm, so file is closed, but memory is never munmaped? Or is that hidden somewhere else and done during GC?

Anyway, I tried this branch on that Wiki benchmark: 10h39m, 3.9GB peak, 0.4% accuracy.

Looks like something went wrong.

maciejkula commented 9 years ago

Looks like it. I'll try to run your training code locally and investigate.

piskvorky commented 9 years ago

For reference, the peak memory is recorded using the memusg script.

maciejkula commented 8 years ago

This is now un-mergeable. I may come back to this in some other way.