google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.25k stars 9.62k forks source link

`Whole Word Masking` does not work in this code at 'create_pretraining_data.py' #910

Open alberto-hong opened 5 years ago

alberto-hong commented 5 years ago

Hi, there :)

First of all, thanks for sharing such a nice code of bert for begginers to easily follow.

I made an issue to share that whole word masking does not work even though I set it as True.

As I read code at ( https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/create_pretraining_data.py#L388), indexes in index_set are not processed as it is desined.

I think it can be easily solved by replace line 388~405 at create_pretraining_data.py as `masked_token = None # 80% of the time, replace with "[MASK]" is_ori_token = False # 10% of the time, keep original and 10% of the time, replace with random word if rng.random() < 0.8: masked_token = "[MASK]" else: if rng.random() < 0.5:

10% of the time, keep original

            is_ori_token = True

    for index in index_set:
        covered_indexes.add(index)

        if masked_token is not None:
            if is_ori_token:
                # 10% of the time, keep original
                masked_token = tokens[index]
            else:
                # 10% of the time, replace with random word
                masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
        output_tokens[index] = masked_token
        masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

`

If it is wrong, please let me know. 👍

zhenjingleo commented 4 years ago

Please refer to paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In 3.3.1 , it say Although this does allow us to obtain a bidirectional pre-trained model, there are two downsides to such an approach. The first is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token is never seen during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my dog is hairy it chooses hairy. It then performs the following procedure:

  1. Rather than always replacing the chosen words with [MASK], the data generator will do the following:
  2. 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
  3. 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
  4. 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed word
dlhmy commented 3 years ago

Hi, there :)

First of all, thanks for sharing such a nice code of bert for begginers to easily follow.

I made an issue to share that whole word masking does not work even though I set it as True.

As I read code at ( https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/create_pretraining_data.py#L388

), indexes in index_set are not processed as it is desined. I think it can be easily solved by replace line 388~405 at create_pretraining_data.py as `masked_token = None # 80% of the time, replace with "[MASK]" is_ori_token = False # 10% of the time, keep original and 10% of the time, replace with random word if rng.random() < 0.8: masked_token = "[MASK]" else: if rng.random() < 0.5:

10% of the time, keep original

is_ori_token = True

    for index in index_set:
        covered_indexes.add(index)

        if masked_token is not None:
            if is_ori_token:
                # 10% of the time, keep original
                masked_token = tokens[index]
            else:
                # 10% of the time, replace with random word
                masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
        output_tokens[index] = masked_token
        masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

`

If it is wrong, please let me know. 👍

you should replace line "if masked_token is not None:" with "if masked_token is None:"

fjzhangcr commented 1 year ago

i have the same inspection as you, my solution is switch the [rng.random() < 0.8] code and the [for index in index_set] code. the idea is we calc the prob only once at the scale of whole-word level. that is if [rng.random() < 0.8] which means all word pieces should be masked, i use the for loop to replace all the index in index_set of [MASK] and other codes below it. Just swich the order of [rng.random() < 0.8] code and the [for index in index_set] code. have fun! the original code:

    """
    for index in index_set:
      # if len(index_set)>0:
      #     logging.info("%s %s" % (index_set,tokens[index]))
      covered_indexes.add(index)

      masked_token = None
      # 80% of the time, replace with [MASK]
      if rng.random() < 0.8:
        masked_token = "[MASK]"
      else:
        # 10% of the time, keep original
        if rng.random() < 0.5:
          masked_token = tokens[index]
        # 10% of the time, replace with random word
        else:
          masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
      output_tokens[index] = masked_token
      # if len(index_set)>1:
      #     logging.info("%s %s" % (tokens[index],output_tokens[index]))
      masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
      """

my code:


    if rng.random() < 0.8:
        for index in index_set:
            covered_indexes.add(index)
            masked_token = "[MASK]"
            output_tokens[index] = masked_token
            masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
    # 10% of the time, keep original
    elif rng.random() < 0.5:
        for index in index_set:
            covered_indexes.add(index)
            masked_token = tokens[index]
            output_tokens[index] = masked_token
            masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
    # 10% of the time, replace with random word
    else:
        for index in index_set:
            covered_indexes.add(index)
            masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
            output_tokens[index] = masked_token
            masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))