Open alberto-hong opened 5 years ago
Please refer to paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In 3.3.1 , it say Although this does allow us to obtain a bidirectional pre-trained model, there are two downsides to such an approach. The first is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token is never seen during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my dog is hairy it chooses hairy. It then performs the following procedure:
Hi, there :)
First of all, thanks for sharing such a nice code of bert for begginers to easily follow.
I made an issue to share that
whole word masking
does not work even though I set it as True.As I read code at ( https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/create_pretraining_data.py#L388
), indexes in index_set are not processed as it is desined. I think it can be easily solved by replace line 388~405 at create_pretraining_data.py as `masked_token = None # 80% of the time, replace with "[MASK]" is_ori_token = False # 10% of the time, keep original and 10% of the time, replace with random word if rng.random() < 0.8: masked_token = "[MASK]" else: if rng.random() < 0.5:
10% of the time, keep original
is_ori_token = True
for index in index_set: covered_indexes.add(index) if masked_token is not None: if is_ori_token: # 10% of the time, keep original masked_token = tokens[index] else: # 10% of the time, replace with random word masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] output_tokens[index] = masked_token masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
`
If it is wrong, please let me know. 👍
you should replace line "if masked_token is not None:" with "if masked_token is None:"
i have the same inspection as you, my solution is switch the [rng.random() < 0.8] code and the [for index in index_set] code. the idea is we calc the prob only once at the scale of whole-word level. that is if [rng.random() < 0.8] which means all word pieces should be masked, i use the for loop to replace all the index in index_set of [MASK] and other codes below it. Just swich the order of [rng.random() < 0.8] code and the [for index in index_set] code. have fun! the original code:
"""
for index in index_set:
# if len(index_set)>0:
# logging.info("%s %s" % (index_set,tokens[index]))
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
# if len(index_set)>1:
# logging.info("%s %s" % (tokens[index],output_tokens[index]))
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
"""
my code:
if rng.random() < 0.8:
for index in index_set:
covered_indexes.add(index)
masked_token = "[MASK]"
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
# 10% of the time, keep original
elif rng.random() < 0.5:
for index in index_set:
covered_indexes.add(index)
masked_token = tokens[index]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
# 10% of the time, replace with random word
else:
for index in index_set:
covered_indexes.add(index)
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
Hi, there :)
First of all, thanks for sharing such a nice code of bert for begginers to easily follow.
I made an issue to share that
whole word masking
does not work even though I set it as True.As I read code at ( https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/create_pretraining_data.py#L388), indexes in index_set are not processed as it is desined.
I think it can be easily solved by replace line 388~405 at create_pretraining_data.py as `masked_token = None # 80% of the time, replace with "[MASK]" is_ori_token = False # 10% of the time, keep original and 10% of the time, replace with random word if rng.random() < 0.8: masked_token = "[MASK]" else: if rng.random() < 0.5:
10% of the time, keep original
`
If it is wrong, please let me know. 👍