NAACL '18 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

jasperzhong commented 4 years ago

https://arxiv.org/pdf/1810.04805.pdf

温故知新...

jasperzhong commented 4 years ago

现在看来，bert在nlp的地位，真的是相当于resnet在cv的地位。各种魔改层出不穷。有必要归根溯源，回顾经典！

第一个贡献是模型，证明了transformer确实牛逼，self-attention是相当于cv中卷积的东西，可以捕捉输入之间的每个词对的双向关系。可以看到bert模型其实非常简洁，想起了resnet.

第二个贡献是输入对各个任务还通用，每个输入前加[CLS]，输入是一个sequence，一个sequence可以是一个句子或者两个句子。区分句子做了两件事：1）句子之间用[SEP]来分隔；2）segment embedding. 输入embedding是三个embedding(word embedding, segment embedding, position embedding)相加.

第三个贡献是预训练任务，之前都是用LM，最大的缺点是只能单向训练，或者简单拼接两个方向的向量（shallow concatenation）；BERT提出了MLM和NSP。MLM就是完形填空，挖掉一些词然后预测，这里有一个问题是预测的时候是不能预测[MASK]的，但是训练的时候太多[MASK]了，所以又一个trick就是如果被选中作mask，80%标签是[MASK]，10%是一个random的词，10%不变, 关于mask的比例，也做了ablation，见C.2 . 第二个任务是NSP，主要对QA和NLI这种需要句子之间联系的任务有用，具体是用最后一层第一个token(CLS)的hidden states输入到分类器。两个loss都是用CrossEntropy，loss相加为总loss.

关于预训练: pre-training的数据及是BooksCorpus + Wikipedia 总共3300M words. batch size 256需要训练1M steps. optimizer是Adam without bias correction, lr=1e-4. warmup ratio 0.01. 模型的activation是gelu而不是relu. 训练bert-base/bert-large需要在16/64上训练4天. 一般分两段，第一段训练steps max_seq_len = 128 90% steps, 第二段max_seq_leq=512训练10%steps.

关于为什么需要训练这么久. 他们也做了ablation说明.

关于Fine-tuning: 比如SQuAD，只需要训练2-3个epoch即可，一般十几分钟就可以训练完. 对模型的改动很小，只需要后面接个Linear就行了，输出start, end，都是[batch_size, seq_len]

图中可以看到，确实需要训练1M steps才收敛...

看了下代码实现(NVIDIA BERT)，又有了不少收获.

完整的bert for pretraining可以分为两个部分，bert model + cls. bert model部分又有 embeddings + encoders + pool. cls 包括 MLM head和 NSP head.

先从输入开始. 输入一般有三个，

inputs_ids [batch_size, seq_len] 这个是words的id，seq_len训练的时候都是固定好的，比如128和512，不够就补0(padding)，过长就截断. 开始是CLS(只需要一个). 两个句子用SEQ分隔. inference的时候不固定.
segment_ids [batch_size, seq_len] 取值在{0, 1}，就是分辨哪些属于第一个句子，哪些属于第二个句子.
attention_mask (optional). 也是0-1向量，就是把padding的标为0，其他为1. mask的具体实现，直接对attention matrix需要mask的地方，加上-10000，这样softmax出来，mask的地方就很小.

word embedding的weight shape是[vocab_size, hidden_size]，一般vocab_size是30522. hidden_size是768或者1024. vocab file. 里面居然有一些日文? 我怀疑是可能是讲日文的wikipedia词条里面带的.

输入的input_ids和segment_ids送到embedding部分，得到embedding_output [batch_size, seq_len, hidden_size]. 然后连attention mask送到encoders.

encoder就是N个transformer encoder堆叠起来，比如bert base有12层. 每层的输出都是 [batch_size, seq_len, hidden_size].

经过encoder后，pool是把[batch_size, seq_len, hidden_size]转为[batch_size, hidden_size]，这个送到后面的NSP head. 因为NSP head需要一个固定的输入大小，所以必须把消除seq_len这个变量. 很简单，就是取第一个token就行了，因为那是CLS. 假定这个会学到整个句子的信息.

NSP cls就是把pool后的作为输入，经过一个Linear层，得到两个输出做binary classification.

MLM cls需要反向把hidden state转成对word预测，这里用了weigh sharing，直接用了word embedding的weight.

所以输出有两个，

prediction_scores. MLM head输出，[batch_size, seq_len, vocab_size]
seq_relationship_score. NSP head输出，[2].

MLM的标签是 [batch_size, seq_len]的vector，非mask的就用-1代替(代表计算loss的时候跳过)，mask过的地方用其真实word id代替. NSP标签就是0或1. 都用CrossEntropy计算loss.

jasperzhong commented 4 years ago

漏了一个更重要的贡献，就是先无监督pretrain然后再有监督finetune的模式，改变了NLP的玩法。之前的pretraining是feature-based，比如ELMo.

jasperzhong commented 4 years ago

话说有没有挖掉图片里面一些像素然后做predict的？这样图像也可以无监督训练了。不知道现在图像无监督都是怎么做的。

yzh119 commented 4 years ago

@vycezhong ，在视觉上的应用应该很多人都想过，直接屏蔽像素大概是不work的，原因很简单，正常一张图片随意mask掉一些像素，那么这些像素直接取context的average就能恢复出来。

目前很多做法是mask掉ROI(region of interest)，去年有很多相关工作，比如LXMERT和VLBERT，不过都是multimodal相关的工作，直接用这种target来训练BERT for vision的应该还没有很显著的工作，今年更火的topic是self-supervised learning。

试图拿self-attention干掉CNN的工作也有：https://papers.nips.cc/paper/8302-stand-alone-self-attention-in-vision-models.pdf

jasperzhong commented 3 years ago

纪念下第一次pretrain bert

Screenshot-20201015085113-1410x160 Screenshot-20201015085154-888x116

jasperzhong commented 3 years ago

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/modeling.py#L364

self-attention计算过程参数：[H, H] x 3 输入:

hidden state: [B, T, H]
attn mask: [B, h, T, T]

projection: [B, T, H] @ [H, H] -> [B, T, H]
reshape and permute: [B, T, H] -> [B, h, T, H/h] / [B, h, H/h, T]
compute attn scores: [B, h, T, H/h] @ [B, h, H/h, T] -> [B, h, T, T]
add mask: [B, h, T, T] + [B, h, T, T] -> [B, h, T, T]
softmax and dropout: [B, h, T, T] (dim = -1)
probs: [B, h, T, T] @ [B, h, T, H/h] -> [B, h, T, H/h]
permute and reshape: [B, h, T, H/h] -> [B, T, H]

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super(BertSelfAttention, self).__init__()
        if config.hidden_size % config.num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = torch.reshape(x, new_x_shape)
        return x.permute(0, 2, 1, 3)

    def transpose_key_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = torch.reshape(x, new_x_shape)
        return x.permute(0, 2, 3, 1)

    def forward(self, hidden_states, attention_mask):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_key_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer)
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = F.softmax(attention_scores, dim=-1)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = torch.reshape(context_layer, new_context_layer_shape)
        return context_layer

jasperzhong commented 2 years ago

https://github.com/google-research/bert#what-is-bert

写得太好了. 终于明白了什么叫做Contextual vs context-free. embedding可以说都是context-free的，对于不同的句子环境，一个单词的意思本来是可以不同的，但是embedding只有一个，所以表示还是一样的。但contextual就不一样，就算是同一个单词，但是输入不同的句子，产生的表示就会不一样！

Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence.

BERT was built upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit — but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right). For example, in the sentence I made a bank deposit the unidirectional representation of bank is only based on I made a but not deposit. Some previous work does combine the representations from separate left-context and right-context models, but only in a "shallow" manner. BERT represents "bank" using both its left and right context — I made a ... deposit — starting from the very bottom of a deep neural network, so it is deeply bidirectional.

jasperzhong / read-papers-and-code

NAACL '18 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding #14