Closed maciejkula closed 8 years ago
Cool! So you went with memmap rather than streaming?
I'll check this on wiki again :)
I guess there are two sides of looking at it.
From the 'training Glove on a prepared matrix' side, streaming seems like a natural thing to do, as we only need to see co-occurrence pairs one at a time. But then memmap is streaming from disk, just with a convenient interface and the kernel doing all the hard work. I think the alternative of a user passing in a pair iterator would be quite slow, as we would incur Python-interfacing costs at every iteration. Parallelism would be tricky as well.
From the 'preparing the co-occurrence matrix' side, true streaming is ruled out by the global nature of the algorithm, as every sentence processed can modify any point in the co-occurrence matrix. So I keep going over the entire matrix, and memmap is just a convenient interface of doing so.
I meant streaming over vectors (~co-occurrence matrix rows), as the GloVe algo seemed intrinsically streamed in nature = processing one input vector at a time. Restricting the stream format to NumPy-arrays-only is fine of course, just less flexible.
I think building the co-occurrence matrix should be easy to stream too, via map/reduce, but then the question is, what framework to use? Extra dependencies &c...
Or maybe just simulate M/R via disk memory and file sorting (sort
& uniq
), just like Levy & Goldberg did in their implementation? Point is not to keep the entire matrix in RAM at any one point.
Anyway, just random ideas.
The implementation is starting to look less and less like a "toy", maybe time to reconsider the Github tagline :)
Hmm, I always thought of it as working one entry at a time. One problem with a vector at a time is you'd be operating on the row word all the time --- a problem for lock-free parallel SGD, it'd keep overwriting itself.
Building the co-occurrence matrix is stupidly parallelizable, you just do it separately over parts of the corpus and add the matrices at the end (assuming you use a consistent dictionary). But how do do this cleanly is not entirely clear to me; I am tempted to add this parallelism note to a docstring and leave doing this to the user.
This PR does not keep the matrix in memory any more, that's the whole point! :)
OK OK I didn't check the new COO matrix in detail. I thought it still used some in-memory counts. Now I see you put extra cleverness + in-memory caching in there.
Looking at the code -- won't there be dangling descriptors after each mmap resize? I'm not sure how NumPy handles munmap during GC, and I don't see explicit releases anywhere.
I think numpy closes the file immediately after the memory map object is created: https://github.com/numpy/numpy/blob/master/numpy/core/memmap.py#L281
Hmm, so file is closed, but memory is never munmap
ed? Or is that hidden somewhere else and done during GC?
Anyway, I tried this branch on that Wiki benchmark: 10h39m, 3.9GB peak, 0.4% accuracy.
Looks like something went wrong.
Looks like it. I'll try to run your training code locally and investigate.
For reference, the peak memory is recorded using the memusg script.
This is now un-mergeable. I may come back to this in some other way.
This PR allows out-of-core computation of both the co-occurrence matrix and word vectors by using memory-mapped numpy arrays.
This should result in memory usage constant in the size of the training corpus, at the cost of a performance penalty related to the amount of memory available and the speed of persistent storage used.
@piskvorky