Transformer Lack of Embedding Layer and Positional Encodings

TrentBrick commented 5 years ago

The Transformer implementation docs (https://pytorch.org/docs/stable/nn.html?highlight=transformer#torch.nn.Transformer) state that they implement the original paper but fail to acknowledge that they don’t implement the following:

Layer Norm as the default normalization option.
Positional Encodings
Embeddings before the encoding and decoding step

It’s fine that these are all not implemented directly in the module but making it more clear that they aren’t and were in the original paper would be helpful.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @brianjo @mruberry @albanD @walterddr @bhosmer @cpuhrsch @anjali411 @zhangguanheng66 @jlin27

jeffreyksmithjr commented 5 years ago

Tagging this to reflect that we could implement the functionality described or just document the existing implementation. Both would seem to be somewhat valid.

andreaskoepf commented 5 years ago

* Layer Norm as the default normalization option.

Could you elaborate what you are missing regarding LayerNorm as default normalization? TransformerEncoderLayer and TransformerDecoderLayer both use LayerNorm (hard built-in) and the nn.Transformer creates a TransformerEncoder with additional final LayerNorm by default.

I think the Transformer abstraction that was implemented is really good as it is. Maybe fairseq could be mentioned for people looking for full NLP models which offer embedding to decoding.

TrentBrick commented 5 years ago

Re LayerNorm: Huh, sorry that I missed that. I guess I got confused by the fact that there is an option for further normalization in the "TransformerEncoder" and Decoder.

I don't think anything more needs to be implemented just better documentation on what people are and aren't getting.

Embeddings aren't a big deal but were used in the original paper and I think it should be flagged that they aren't here.

The same is true for the positional encodings but these are much more important for the model's performance and difficult to implement. (this was noted here https://github.com/pytorch/pytorch/issues/10459#issuecomment-413116713)

Vichoko commented 4 years ago

Recently found this snippet in here that implements PositionalEncoding that can be easily added at the beggining of your forward(x) and before calling the transformer encoder forward.

I updated to work for more recent versions:

class PositionalEncoder(torch.nn.Module):
    def __init__(self, d_model, max_seq_len=160):
        super().__init__()
        self.d_model = d_model
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                    math.sin(pos / (10000 ** ((2 * i) / d_model)))
                pe[pos, i + 1] = \
                    math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        with torch.no_grad():
            x = x * math.sqrt(self.d_model)
            seq_len = x.size(1)
            pe = self.pe[:, :seq_len]
            x = x + pe
            return x

Edit: Surely that nested for can be optimized with tensor logic because for big sequences or big d_model parameter it can take some time to initialize the module.

volpato30 commented 4 years ago

Recently found this snippet in here that implements PositionalEncoding that can be easily added at the beggining of your forward(x) and before calling the transformer encoder forward.

I updated to work for more recent versions:
class PositionalEncoder(torch.nn.Module):
    def __init__(self, d_model, max_seq_len=160):
        super().__init__()
        self.d_model = d_model
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                    math.sin(pos / (10000 ** ((2 * i) / d_model)))
                pe[pos, i + 1] = \
                    math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        with torch.no_grad():
            x = x * math.sqrt(self.d_model)
            seq_len = x.size(1)
            pe = self.pe[:, :seq_len]
            x = x + pe
            return x
Edit: Surely that nested for can be optimized with tensor logic because for big sequences or big d_model parameter it can take some time to initialize the module.

May I ask why use no_grad in the forward function? Wouldn't it prevent the update of word embeddings?

kryptec commented 4 years ago

This tutorial uses the TransformerEncoder class to implement the original paper, and looks like it has a cleaner positional encoding.

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

mmwebster commented 3 years ago

+1 on it being odd that nn.Transformer doesn't implement positional encoding. The lack of word embeddings makes sense, but it seems like the vast majority of transformers would require positional encodings.

Also, example usage for nn.Transformer training and evaluation would improve the docs a lot. It seems like nn.Transformer has been abandoned in favor of the encoder, decoder, etc. classes that comprise it, as suggested by the tutorial that @kryptec referenced. Seems like it would be better to have a tutorial for nn.Transformer, and details on how to override pieces of functionality, like how positional encoding is performed.

[Edit: I could make a PR for nn.Transformer docs]

Vichoko commented 3 years ago

Yeah, right. Also it could be useful to provide an implementation of new alternative positional encodings, for example, relative positional encodings.

david-waterworth commented 3 years ago

I agree positional encoding should really be implemented and part of the transformer - I'm less concerned that the embedding is separate.

In particular, the input shape of the PyTorch transformer is different from other implementations (src is SNE rather than NSE) meaning you have to be very careful using common positional encoding implementations.

I also find it a little odd/confusing that src is SNE yet src_key_padding_mask is NS

jbschlosser commented 3 years ago

Bumping to high pri due to activity & because positional encodings are generally essential for using the Transformer layers effectively

narain1 commented 3 years ago

can we just push the positional encoding from pytorch transformer tutorial to the nn module? and I find it to be in terms with the paper

granitdula commented 3 years ago

I'm also in agreement with having positional encoding included by default in the nn.Transformer class.

aohan237 commented 3 years ago

vote for this feature

ankit61 commented 3 years ago

I too agree that positional encodings should be a part of the transformer class. However, instead of hard coding them, it is perhaps better to pass a flag in the constructor about whether to use positional encodings or not (can default to true). There definitely can be (and are) cases when positional encodings are not necessary, so such flexibility would be important

david-waterworth commented 3 years ago

I agree with @ankit61 - many transformer models use position embeddings in place of the original sinusoidal encoding, and some may use no encoding. So the actual embedding layer should be configurable

hanjiemicro commented 2 years ago

Is there any progress now?

andreaskoepf commented 2 years ago

Just for reference here another impl in pytorch by @lucidrains in his perceiver (IO) impl:

perceiver_pytorch.py#L32-L42:

def fourier_encode(x, max_freq, num_bands = 4):
    x = x.unsqueeze(-1)
    device, dtype, orig_x = x.device, x.dtype, x

    scales = torch.linspace(1., max_freq / 2, num_bands, device = device, dtype = dtype)
    scales = scales[(*((None,) * (len(x.shape) - 1)), Ellipsis)]

    x = x * scales * pi
    x = torch.cat([x.sin(), x.cos()], dim = -1)
    x = torch.cat((x, orig_x), dim = -1)
    return x

and use here perceiver_pytorch.py#L228-L236:

# calculate fourier encoded positions in the range of [-1, 1], for all axis

axis_pos = list(map(lambda size: torch.linspace(-1., 1., steps = size, device = device), axis))
pos = torch.stack(torch.meshgrid(*axis_pos), dim = -1)
enc_pos = fourier_encode(pos, self.max_freq, self.num_freq_bands)
enc_pos = rearrange(enc_pos, '... n d -> ... (n d)')
enc_pos = repeat(enc_pos, '... -> b ...', b = b)

data = torch.cat((data, enc_pos), dim = -1)

Regarding advanced use cases here a description of 2D position encoding for pixel data from the PerceiveIO paper appendix D "Positional encodings for image and audio experiments":

"We use a 2D Fourier feature positional encoding (..) using sine and cosine bands with frequencies spaced linearly from a minimum frequency to a maximum frequency. We use 64 sine/cosine bands per dimension in all settings. The minimum frequency is always set to the minimum frequency of the input signal, corresponding to a single full oscillation over the input dimension. The maximum frequency is typically set to the input’s Nyquist frequency (e.g. 112 cycles for an image with 224 pixels per dimension). As in [35], the input position used to construct the Fourier frequencies is scaled to [-1, 1] for each input dimension. For example, the upper left corner of an image is at position [-1, -1] while the bottom right corner is at position [1, 1]. We follow the same strategy using 1D and 3D Fourier feature positional encoding for audio’s time and video’s spatiotemporal inputs, respectively."

rodosingh commented 1 year ago

Recently found this snippet in here that implements PositionalEncoding that can be easily added at the beggining of your forward(x) and before calling the transformer encoder forward.

I updated to work for more recent versions:
class PositionalEncoder(torch.nn.Module):
    def __init__(self, d_model, max_seq_len=160):
        super().__init__()
        self.d_model = d_model
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                    math.sin(pos / (10000 ** ((2 * i) / d_model)))
                pe[pos, i + 1] = \
                    math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        with torch.no_grad():
            x = x * math.sqrt(self.d_model)
            seq_len = x.size(1)
            pe = self.pe[:, :seq_len]
            x = x + pe
            return x
Edit: Surely that nested for can be optimized with tensor logic because for big sequences or big d_model parameter it can take some time to initialize the module.

According to @Vichoko, math.sqrt(self.d_model) is there in forward() call. But according to official pytorch tutorial, we don't find any square-root applied (Link). Please clarify on this!

nkalpakis21 commented 1 day ago

i see this has been open for awhile. is there trouble getting a dev to fix this issue? what is blocking this for so long

pytorch / pytorch

Transformer Lack of Embedding Layer and Positional Encodings #24826