idiap / fast-transformers

Pytorch library for fast transformer implementations
1.63k stars 175 forks source link

Implementation of random Fourier features #38

Closed angeloskath closed 4 years ago

angeloskath commented 4 years ago

We should implement some of the RFF approaches of https://arxiv.org/abs/2009.14794 .

They can used directly as a feature map with the LinearAttention implementation.

bratao commented 4 years ago

There is a implementation (in jax) here https://github.com/google-research/google-research/tree/master/performer/fast_self_attention

angeloskath commented 4 years ago

Clean implementations for random Fourier features for the RBF kernel as well as the positive random features for the softmax kernel are now available in the branch feature-maps.

Note that this comes with a few backwards incompatible changes in the way feature maps are now implemented in the library. I am planning on writing some docs before merging into master and then I should probably collect all the changes to make a new release.

In case someone wants to try it something like the following should work

from functools import partial
from fast_transformers.builders import TransformerEncoderBuilder
from fast_transformers.feature_maps import Favor

t = TransformerEncoderBuilder.from_kwargs(
    n_layers=6,
    n_heads=8,
    query_dimensions=64,
    attention_type="linear",
    feature_map=partial(Favor, n_dims=256)
).get()

Cheers, Angelos

bratao commented 4 years ago

@angeloskath I´m way too excited for this 🥳

I just added a feature_map to my pipeline as your last comment and unfortunately it is consistently giving nan loss after a couple of epochs. Without the feature_map the task converges ok. I´m doing Sequence Labeling for very long texts.

Should I pay attention to something else?

angeloskath commented 4 years ago

Hi @bratao ,

Thanks for finding this out quickly. To be honest, I wrote and did not train a transformer with it :-). I was content with the tests passing. But indeed the implementation was numerically quite unstable. Basically I was multiplying two exponentials which is never a good idea when one can go to 0 and the other to infinity.

So, now I wrote it in a numerically more stable way and you can also give the argument stabilize=True which ensures that you will never get infinities from the feature map. I am training on a toy example without stabilize and it works fine.

Let me know if you are still experiencing problems.

Cheers, Angelos

angeloskath commented 4 years ago

So, I just pushed to master with completed (although not extensive) docs and also in addition to Favor, I have added GeneralizedRandomFeatures that implement Transformer-RELU by default from the Performers paper.

I believe we now have a good interface to develop any number of feature maps symmetric or asymmetric. Feel free to report any problems but I think that after the tests pass I can safely close this issue.

Cheers, Angelos

hadaev8 commented 4 years ago

@angeloskath How slower should it be compared to elu feature map? I ran a small experiment, linear elu on pair with softmax, while favor ~x10 slower.

angeloskath commented 4 years ago

If the sequence length is small then the feature size becomes larger than the sequence length so softmax will be faster.

You could try for various sequence lengths to get an idea. It is faster when the sequence length becomes large.

How large is the n_dims that you are using?

hadaev8 commented 4 years ago

@angeloskath Well, was a bit unexpected because of elu with small len on pair with softmax and around 200+ became better. I tried transformer with 4x64 and favor n_dims 128. (just scaled from example above). So I need a lenght greater then 128? Is where is the rule for target len and n_dims value? If i have 256 len, should i use n_dims like 128, 64 or smaller?

andompesta commented 3 years ago

Hi, First of all thanks for the beautiful library.

But, I a bit confused. Why we have to "sample" a new set of orthogonal weight for every forward call ?? Should this be done only once, otherwise the output would not be deterministic.

self.feature_map.new_feature_map()

angeloskath commented 3 years ago

Hi,

Thanks for the good words!

So, for ω ~ N(0, I), exp(ω x - |x|^2) is an unbiased feature map (I have ommitted some constants) of the exponential of the dot product. This means that E_ω[<exp(ω x - |x|^2), exp(ω y - |y|^2)>] = exp(<x, y>) . But that does not mean that the error will be small for all ω. That is why we have to increase the dimensionality. So that we are averaging over many ω.

However, even if we average over many ω, keeping them fixed throughout learning would result in some bias because we are not sampling any more. For instance it would be like training with SGD using a fixed subsample of your dataset. I mean it will work, but why not use all of it.

Having said that, keeping them fixed is not necessarily a bad idea. In linear attention there is a tradeoff between expressivity and speed. Using Fourier features is a really elegant way to increase the expressivity by increasing the feature dimensionality. It is not necessary that the feature map is an approximation of softmax. (see also GeneralizedRandomFeatures).

Regarding implementing this, you could easily make a FeatureMap that wraps any RandomFourierFeatures and only calls new_feature_map() if omega is all zeros.

Cheers, Angelos

andompesta commented 3 years ago

Thanks for the clarification (I'm not really familiar with kernel methods). I have read the paper more carefully and Fig. 5 shows how resampling (feature redrawing) is needed to match the Transformers performances. However, I haven't fount explicit indication on their resampling policy:

I think I need to play around with it a bit (suggestions accepted)

Cheers, Sandro