use cached matrix of pairwise dot products where possible

senderle commented 4 years ago

Currently the 3CosAdd and 3CosMul methods run very slowly. It should be possible to speed both operations with the strategy, used by hyperwords, of precalculating all pairwise cosine similarity scores once, and simply adding, subtracting, multiplying, or dividing the resulting scalars for tests.

This is possible because cos(b*, a* − a + b) can be distributed out to cos(b*, a*) − cos(b*, a) + cos(b*, b). (The second looked worse to me until I realized you can precompute all possible pairwise dot products and never have to calculate any more, whereas every new combination of a*, a, b in the first case requires executing a new dot product.)

Given the current architecture this might not be a trivial operation, but at least for smaller vocabularies, it should definitely be possible, and could yield a ~100x speedup by leveraging fast linear algebra routines.

This is something I'd be interested in working on, though I may not be able to devote time to it in the near term. If that would be of interest, though, please let me know.

undertherain commented 4 years ago

Hi! Yeah that's interesting and does not look like it will take a lot of code to implement. One reservation I have, as you pointed out, is that with larger-ish vocab size the cached matrix of dot products will take super-huge memory space. For instance I typically see an order of 100K tokens vocabs, meaning matrix size will be 40GB. So Ideally we'd check memory available RAM size before doing this and give appropriate feedback to user

senderle commented 4 years ago

Yeah I was going back over this in my head and remembering that what makes it practical is that the full VxV matrix doesn't need to be stored. It should be possible to store a VxT matrix of cosine values, where T is the number of words in the analogy test set.

Then the solution for a given analogy is (VT[:, a_] - VT[:, a] + VT[:, b]).argmax().

Maybe you still couldn't do it with all words for all 40 tests at once. But I bet it would still give a good speedup if you did it just once for each test — that would be, what, 2500 words or so? For 32 bit floats seems like that's 2 gigs, well within the capacity of most modern laptops.

senderle commented 4 years ago

Wait, no! It's way smaller because the number of distinct words per test is far smaller than the number of tests, since the tests look at every pair against every other pair. There must only be ~2k distinct terms in the entire BATS test set — is that right? In that case you could definitely do the full VxT matrix. Though the approach of splitting them by test might be more practical from an implementation standpoint.

undertherain commented 4 years ago

You can not restrict the space of possible answers to word from the test set. basically vocabulary size and content affects performance of different embeddings tremendously. For fair comparison between two sets of embeddings we first make vocabularies identical. If only words from the test are allowed, the answer would most often be correct.

senderle commented 4 years ago

This won't restrict the space of possible answers though — we need all possible outputs (the V dimension) but we only need a subset of inputs (the T dimension).

senderle commented 4 years ago

I wrote a quick example to show what I mean. I believe it works as it should, but let me know if you see something wrong.

(This used to be a gist, but gist wasn't playing nicely with the sample vector file so I just created a standard repo.)

https://github.com/senderle/3cosadd-demo

vecto-ai / vecto

use cached matrix of pairwise dot products where possible #95