google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.79k stars 9.56k forks source link

How to understand the "we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings." mentioned in the paper A.2 Pre-training Procedure ??? #969

Open pogevip opened 4 years ago

pogevip commented 4 years ago

image

How to understand the "we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings." mentioned in the paper A.2 Pre-training Procedure ???

I can't understand the meaning of this sentence. Thank you for your answers!

real-brilliant commented 4 years ago

是这样的 位置编码和句子里每个位置有关,在bert里是一个初始化的、可训练的、和word_embedding维度相同的向量 因此如果一直使用128长度的句子训练,后面inference的时候你就只能使用最长128的句子,因为再长的话位置编码没有训练过,model无法获知后面句子和前面句子的相对位置关系,因此需要使用更长的句子再对pos_embedding做训练,当然你也可以适当调节,比如256,1024等。

transformer里由于选择的是三角函数生成的定值而非可训练的pos_embedding,所以在inference时就可以输入比training阶段更长的句子