Query regarding lm_cross_sentence transformer implementation

ashu5644 commented 4 years ago

Hi guys, is transformer config in 2019-lm-cross-sentence in somewhat memory implementation similar to transformer-xl or different? And what is reason for not using positional encoding?

kazuki-irie commented 4 years ago

is transformer config in 2019-lm-cross-sentence in somewhat memory implementation similar to transformer-xl or different?

Yes, it is the segment-wise context carry over similar to the one in the Transformer XL. (i.e. while processing a segment, the model has access to the current segment and the states in the previous segment).

And what is reason for not using positional encoding?

In our default set-ups (sentence-level language modeling), we do not use any explicit positional encoding in Transformer language models (as that typically results in a slightly better perplexity). If you are interested in more details, please check our Interspeech 2019 paper: https://arxiv.org/abs/1905.04226

That's why we had this as one of the variants we studied in our ASRU 2019 paper: "Training Language Models for Long-Span Cross-Sentence Evaluation" (available on https://www-i6.informatik.rwth-aachen.de/web/Publications/index.html)

ashu5644 commented 4 years ago

memory implementation part is clear to me now. but positional encoding part is still unclear. in paper https://arxiv.org/abs/1905.04226 as you have mentioned "However, in the autoregressive problem where a new token is provided to the model at each time step, the amount of information the model has access to strictly increases from left to right at the lowest level of the network, which should provide some positional information by its own". But in transformer default implementation attention weights of previous time steps for each time step are calculated simultaneously in form of parallel computations in matrix multiplication. i.e not like the RNN's where each step is computated sequentially. So how does this store information in increasing manner in left to right? And output to first self-attention layer is embedding from linear layer which don't have any positional information.

kazuki-irie commented 4 years ago

I realized you are the one who had asked a similar question previously https://github.com/rwth-i6/returnn/issues/254

It might be useful to review our answer there (with the link to the slides) as it is directly related... Anyway,

But in transformer default implementation attention weights of previous time steps for each time step are calculated simultaneously in form of parallel computations in matrix multiplication. i.e not like the RNN's where each step is computated sequentially.

Independent of whether your model is a Transformer or RNN, it's language modeling (I'm assuming that your question is about the standard autoregressive language modeling): for predicting the token at the position t, the model only have access to the left context, i.e. the tokens from position 0 to t-1. Now if you move to the next position (t+1), the model has access to one more position, tokens from 0 to t. Each time step gives the model access to more information.

This accessibility to different amounts of information (number of positions) at different time steps is done via masking which comes after the parallel computation you are referring to in case of Transformers.

Does this clarify the things better?

I note that, again, this is nothing specific to RETURNN but the basic understanding of the Transformer language models.

ashu5644 commented 4 years ago

ok, yes this way it seems to carry more information as going from left to right, which can help to provide similar information as of position encoding.

ashu5644 commented 4 years ago

How many sentences are in switchboard lm corpus that you have used for transformer lm training?

kazuki-irie commented 4 years ago

About 2.5 milion sentences in the training data.

ashu5644 commented 4 years ago

ok, thanks

rwth-i6 / returnn-experiments

Query regarding lm_cross_sentence transformer implementation #36