How to understand the "we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings." mentioned in the paper A.2 Pre-training Procedure ???

google-research / bert

TensorFlow code and pre-trained models for BERT

Apache License 2.0

37.79k stars 9.56k forks source link

是这样的位置编码和句子里每个位置有关，在bert里是一个初始化的、可训练的、和word_embedding维度相同的向量因此如果一直使用128长度的句子训练，后面inference的时候你就只能使用最长128的句子，因为再长的话位置编码没有训练过，model无法获知后面句子和前面句子的相对位置关系，因此需要使用更长的句子再对pos_embedding做训练，当然你也可以适当调节，比如256，1024等。

transformer里由于选择的是三角函数生成的定值而非可训练的pos_embedding，所以在inference时就可以输入比training阶段更长的句子

google-research / bert

How to understand the "we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings." mentioned in the paper A.2 Pre-training Procedure ??? #969