Open TrentBrick opened 5 years ago
Tagging this to reflect that we could implement the functionality described or just document the existing implementation. Both would seem to be somewhat valid.
* Layer Norm as the default normalization option.
Could you elaborate what you are missing regarding LayerNorm
as default normalization?
TransformerEncoderLayer and TransformerDecoderLayer both use LayerNorm
(hard built-in) and the nn.Transformer
creates a TransformerEncoder
with additional final LayerNorm
by default.
I think the Transformer abstraction that was implemented is really good as it is. Maybe fairseq could be mentioned for people looking for full NLP models which offer embedding to decoding.
Re LayerNorm: Huh, sorry that I missed that. I guess I got confused by the fact that there is an option for further normalization in the "TransformerEncoder" and Decoder.
I don't think anything more needs to be implemented just better documentation on what people are and aren't getting.
Embeddings aren't a big deal but were used in the original paper and I think it should be flagged that they aren't here.
The same is true for the positional encodings but these are much more important for the model's performance and difficult to implement. (this was noted here https://github.com/pytorch/pytorch/issues/10459#issuecomment-413116713)
Recently found this snippet in here that implements PositionalEncoding that can be easily added at the beggining of your forward(x)
and before calling the transformer encoder forward.
I updated to work for more recent versions:
class PositionalEncoder(torch.nn.Module):
def __init__(self, d_model, max_seq_len=160):
super().__init__()
self.d_model = d_model
pe = torch.zeros(max_seq_len, d_model)
for pos in range(max_seq_len):
for i in range(0, d_model, 2):
pe[pos, i] = \
math.sin(pos / (10000 ** ((2 * i) / d_model)))
pe[pos, i + 1] = \
math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
with torch.no_grad():
x = x * math.sqrt(self.d_model)
seq_len = x.size(1)
pe = self.pe[:, :seq_len]
x = x + pe
return x
Edit: Surely that nested for can be optimized with tensor logic because for big sequences or big d_model
parameter it can take some time to initialize the module.
Recently found this snippet in here that implements PositionalEncoding that can be easily added at the beggining of your
forward(x)
and before calling the transformer encoder forward.I updated to work for more recent versions:
class PositionalEncoder(torch.nn.Module): def __init__(self, d_model, max_seq_len=160): super().__init__() self.d_model = d_model pe = torch.zeros(max_seq_len, d_model) for pos in range(max_seq_len): for i in range(0, d_model, 2): pe[pos, i] = \ math.sin(pos / (10000 ** ((2 * i) / d_model))) pe[pos, i + 1] = \ math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model))) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): with torch.no_grad(): x = x * math.sqrt(self.d_model) seq_len = x.size(1) pe = self.pe[:, :seq_len] x = x + pe return x
Edit: Surely that nested for can be optimized with tensor logic because for big sequences or big
d_model
parameter it can take some time to initialize the module.
May I ask why use no_grad in the forward function? Wouldn't it prevent the update of word embeddings?
This tutorial uses the TransformerEncoder class to implement the original paper, and looks like it has a cleaner positional encoding.
https://pytorch.org/tutorials/beginner/transformer_tutorial.html
+1 on it being odd that nn.Transformer doesn't implement positional encoding. The lack of word embeddings makes sense, but it seems like the vast majority of transformers would require positional encodings.
Also, example usage for nn.Transformer training and evaluation would improve the docs a lot. It seems like nn.Transformer has been abandoned in favor of the encoder, decoder, etc. classes that comprise it, as suggested by the tutorial that @kryptec referenced. Seems like it would be better to have a tutorial for nn.Transformer, and details on how to override pieces of functionality, like how positional encoding is performed.
[Edit: I could make a PR for nn.Transformer docs]
Yeah, right. Also it could be useful to provide an implementation of new alternative positional encodings, for example, relative positional encodings.
I agree positional encoding should really be implemented and part of the transformer - I'm less concerned that the embedding is separate.
In particular, the input shape of the PyTorch transformer is different from other implementations (src is SNE rather than NSE) meaning you have to be very careful using common positional encoding implementations.
I also find it a little odd/confusing that src is SNE yet src_key_padding_mask is NS
Bumping to high pri due to activity & because positional encodings are generally essential for using the Transformer layers effectively
can we just push the positional encoding from pytorch transformer tutorial to the nn module? and I find it to be in terms with the paper
I'm also in agreement with having positional encoding included by default in the nn.Transformer class.
vote for this feature
I too agree that positional encodings should be a part of the transformer class. However, instead of hard coding them, it is perhaps better to pass a flag in the constructor about whether to use positional encodings or not (can default to true). There definitely can be (and are) cases when positional encodings are not necessary, so such flexibility would be important
I agree with @ankit61 - many transformer models use position embeddings in place of the original sinusoidal encoding, and some may use no encoding. So the actual embedding layer should be configurable
Is there any progress now?
Just for reference here another impl in pytorch by @lucidrains in his perceiver (IO) impl:
def fourier_encode(x, max_freq, num_bands = 4):
x = x.unsqueeze(-1)
device, dtype, orig_x = x.device, x.dtype, x
scales = torch.linspace(1., max_freq / 2, num_bands, device = device, dtype = dtype)
scales = scales[(*((None,) * (len(x.shape) - 1)), Ellipsis)]
x = x * scales * pi
x = torch.cat([x.sin(), x.cos()], dim = -1)
x = torch.cat((x, orig_x), dim = -1)
return x
and use here perceiver_pytorch.py#L228-L236:
# calculate fourier encoded positions in the range of [-1, 1], for all axis
axis_pos = list(map(lambda size: torch.linspace(-1., 1., steps = size, device = device), axis))
pos = torch.stack(torch.meshgrid(*axis_pos), dim = -1)
enc_pos = fourier_encode(pos, self.max_freq, self.num_freq_bands)
enc_pos = rearrange(enc_pos, '... n d -> ... (n d)')
enc_pos = repeat(enc_pos, '... -> b ...', b = b)
data = torch.cat((data, enc_pos), dim = -1)
Regarding advanced use cases here a description of 2D position encoding for pixel data from the PerceiveIO paper appendix D "Positional encodings for image and audio experiments":
"We use a 2D Fourier feature positional encoding (..) using sine and cosine bands with frequencies spaced linearly from a minimum frequency to a maximum frequency. We use 64 sine/cosine bands per dimension in all settings. The minimum frequency is always set to the minimum frequency of the input signal, corresponding to a single full oscillation over the input dimension. The maximum frequency is typically set to the input’s Nyquist frequency (e.g. 112 cycles for an image with 224 pixels per dimension). As in [35], the input position used to construct the Fourier frequencies is scaled to [-1, 1] for each input dimension. For example, the upper left corner of an image is at position [-1, -1] while the bottom right corner is at position [1, 1]. We follow the same strategy using 1D and 3D Fourier feature positional encoding for audio’s time and video’s spatiotemporal inputs, respectively."
Recently found this snippet in here that implements PositionalEncoding that can be easily added at the beggining of your
forward(x)
and before calling the transformer encoder forward.I updated to work for more recent versions:
class PositionalEncoder(torch.nn.Module): def __init__(self, d_model, max_seq_len=160): super().__init__() self.d_model = d_model pe = torch.zeros(max_seq_len, d_model) for pos in range(max_seq_len): for i in range(0, d_model, 2): pe[pos, i] = \ math.sin(pos / (10000 ** ((2 * i) / d_model))) pe[pos, i + 1] = \ math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model))) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): with torch.no_grad(): x = x * math.sqrt(self.d_model) seq_len = x.size(1) pe = self.pe[:, :seq_len] x = x + pe return x
Edit: Surely that nested for can be optimized with tensor logic because for big sequences or big
d_model
parameter it can take some time to initialize the module.
According to @Vichoko, math.sqrt(self.d_model)
is there in forward()
call. But according to official pytorch tutorial, we don't find any square-root applied (Link).
Please clarify on this!
i see this has been open for awhile. is there trouble getting a dev to fix this issue? what is blocking this for so long
The Transformer implementation docs (https://pytorch.org/docs/stable/nn.html?highlight=transformer#torch.nn.Transformer) state that they implement the original paper but fail to acknowledge that they don’t implement the following:
It’s fine that these are all not implemented directly in the module but making it more clear that they aren’t and were in the original paper would be helpful.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @brianjo @mruberry @albanD @walterddr @bhosmer @cpuhrsch @anjali411 @zhangguanheng66 @jlin27