graykode / nlp-tutorial

Natural Language Processing Tutorial for Deep Learning Researchers
https://www.reddit.com/r/MachineLearning/comments/amfinl/project_nlptutoral_repository_who_is_studying/
MIT License
14.32k stars 3.95k forks source link

Some mistake in Transformer Position Encoding & BERT #22

Closed graykode closed 5 years ago

graykode commented 5 years ago

1. mistake in Transformer

# Padding Should be Zero
src_vocab = {'P' : 0, 'ich' : 1, 'mochte' : 2, 'ein' : 3, 'bier' : 4}
src_vocab_size = len(src_vocab)

tgt_vocab = {'P' : 0, 'i' : 1, 'want' : 2, 'a' : 3, 'beer' : 4, 'S' : 5, 'E' : 6}
number_dict = {i: w for i, w in enumerate(tgt_vocab)}
tgt_vocab_size = len(tgt_vocab)

I changed my code more clearly. There are some mis-points in Transformer about Position Encoding, beacause of torch.LongTensor([[1,2,3,4,5]]) that the indexing of Embedding is a mixed issue.

So I fixed right with shape of get_sinusoid_encoding_table. In Encoder, self.pos_emb(torch.LongTensor([[5,1,2,3,4]])) is right as ich mochte ein bier P and Decoder, self.pos_emb(torch.LongTensor([[5,1,2,3,4]])) is right as S i want a beer

2. Too heavy BERT as tutorial

In original paper, maxlen is 512, n_layer(number of layers) are 12, but in this tutorial, that is too heavy to run,, so I fiex below this.

# BERT Parameters
maxlen = 30
batch_size = 6
max_pred = 5 # max tokens of prediction
n_layers = 6
n_heads = 12
d_model = 768
d_ff = 768*4 # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

Also other implementation repository about BERT, when pre processing about masking, [CLS], [SEP], [PAD] should not to be changed as MASK

cand_maked_pos = [i for i, token in enumerate(input_ids)] # wrong this.

https://github.com/dhlee347/pytorchic-bert/blob/master/pretrain.py#L132 this code is right, so i fixed it.

Then, I added SEGMENT MASK for masking where token is zero padding. This is very import problem.