hyperdimensional-computing / torchhd

Torchhd is a Python library for Hyperdimensional Computing and Vector Symbolic Architectures
https://torchhd.readthedocs.io
MIT License
233 stars 24 forks source link

Experiment with different implementations and their speeds #26

Closed mikeheddes closed 1 year ago

mikeheddes commented 2 years ago

For example the functional.ngrams function can be implemented in a number of ways, we should look into which way is the fasted both on CPU and GPU. The same goes for other functions, classes and loading of datasets.

mikeheddes commented 2 years ago

If you only pass it one hypervector there are no bigrams. If you had a tensor T of 3 by d with 3 hypervectors of d dimensions then it would be T[0]P(T[1]) +T[1]P(T[2]) where P is permutation. Hope this helps

On May 13, 2022, at 4:26 PM, linxinyuan @.***> wrote:

What's the bigram for torch.tensor([[1,2,3,4,5,6]])?

— Reply to this email directly, view it on GitHub https://github.com/hyperdimensional-computing/torchhd/issues/26#issuecomment-1126576091, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGH6NVIWYI2EAITTO5RVY3LVJ3QJJANCNFSM5VV2SMJQ. You are receiving this because you authored the thread.

aglinxinyuan commented 2 years ago

Are the distinct_sequence() and the ngrams() doing the same thing? For the tensor torch.randint(-1, 2, (100,3,10000)), unbind approach is slower than the for loop approach on both CPU and GPU.

CPU: ngrams: 8.08 ms ± 62.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) distinct_sequence: 9.65 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

GPU: ngrams: 324 µs ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) distinct_sequence: 357 µs ± 6.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

mikeheddes commented 2 years ago

No a distinct sequence implements a binding-based sequence where for the same 3 element tensor T the result should be: P(P(T[0])) P(T[1]) T[2]. Here all the elements are permuted by a decreasing amount such that the last element is not permuted and all the permuted hypervectors are bound together instead of bundled.

You can make a version of distinct_sequence that uses a for loop instead of unbind and compare their speeds. You can then use that same implementation to see if you can speedup sequence.

aglinxinyuan commented 2 years ago

Can you give me an example that distinct_sequence() and ngrams() return different results? I understand that the formulas of those two functions are different, but they always return the same results when I test them with different shapes.

mikeheddes commented 2 years ago

Here's an example:

>>> from torchhd import functional
>>> x = functional.random_hv(10, 1000)
>>> ngrams = functional.ngrams(x)
>>> distinct_sequence = functional.distinct_sequence(x)
>>> ngrams == distinct_sequence
tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
aglinxinyuan commented 2 years ago

They will still return the same result if I changed n to 10: ngrams = functional.ngrams(x,n=10). Are they supposed to return the same result on the same tensor when they have the same n?

mikeheddes commented 2 years ago

Yes, that is correct. The ngrams are like an interpolation between binding-based sequences (distinct sequence) and a multiset. If you set n to the size of the input sequence there is no difference with the distinct sequence because there is only one ngram of that size. And when you set n to 1 it is equivalent to a multiset

aglinxinyuan commented 2 years ago

I tried more than ten approaches: unbind, for-loop, select, index_select, strided, in-place, out-of-place, clone, empty, and gather. The combination of In-place mul and for-loop is significantly faster than the others. The same approach #33 can be applied to distinct_sequence() and sequence() after verification.

aglinxinyuan commented 2 years ago

Update on PyTorch C++ extension: it requires the user to install a C++ compiler which is easy for Linux users (just one command) but hard for windows users(need to download a large installer package manually and config it manually). So we stay away from C++ to make it easy to use for Windows users.

mikeheddes commented 2 years ago

Thank you for the update. I agree, I don't think we have the expertise yet to pull off such a project.

mikeheddes commented 2 years ago

I experimented with different ways of generating random hypervectors. Here's what I found:

def random1(n, d):
    selection = torch.randint(0, 2, size=(n * d,))
    options = torch.tensor([1, -1], dtype=dtype, device=device)
    hv = torch.index_select(options, 0, selection)
    return hv.view(n, d)

def random2(n, d):
    return torch.randint(0, 2, size=(n, d,)) * 2 - 1

def random3(n, d):
    p = torch.full((n, d,), 0.5,)
    x = torch.bernoulli(p)
    return x.mul(2.0).sub(1.0)

def random4(n, d):
    return torch.empty((num_embeddings, embedding_dim,)).uniform_(-1, 1).sign_()

def random5(n, d):
    return torch.randn(num_embeddings, embedding_dim).sign_()

def random6(n, d):
    return torch.rand((n, d,)).sub_(0.5).sign_()

def random7(n, d):
    res = torch.empty((n, d,), dtype=torch.bool).bernoulli_(0.5)
    return torch.where(res, +1, -1)

And got the following results on my 2017 MacBook Pro with 2.5 GHz Dual-Core Intel Core i7:

$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random1(10, 10000)" 
500 loops, best of 5: 550 usec per loop
$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random2(10, 10000)"
500 loops, best of 5: 737 usec per loop
$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random3(10, 10000)"
500 loops, best of 5: 725 usec per loop
$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random4(10, 10000)"
500 loops, best of 5: 454 usec per loop
$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random5(10, 10000)"
500 loops, best of 5: 560 usec per loop
$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random6(10, 10000)"
500 loops, best of 5: 501 usec per loop
$ python -m timeit -s "import torch, torchhd" "torchhd.functional.random7(10, 10000)"
2000 loops, best of 5: 115 usec per loop

I will make a pull request for the 7th version because it allows for passing in a generator which might be helpful to people. We can easily add a sparsity parameter by altering the 0.5 value.

mikeheddes commented 2 years ago

Created pull request in #47