huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.67k stars 26.7k forks source link

Using BERT for predicting masked token #942

Closed chinmay5 closed 5 years ago

chinmay5 commented 5 years ago

I have a task where I want to obtain better word embeddings for food ingredients. Since I am a bit new to the field of NLP, I have certain fundamental doubts as well which I would love to be corrected upon.

  1. I want to get word embeddings so started with Word2Vec. Now, I want to get more contextual representation so using BERT
  2. There is no supervised data and so I want to learn embeddings similar to the MASKED training procedure followed in BERT paper itself.
  3. I have around 1000 ingredients and each recipe can consist of multiple ingredients.
  4. Since BERT works well if we have only one MASKED word, so I would ideally copy the recipe text multiple times and replace ingredients with "MASK" one by one. So, if I have 1 recipe with 5 ingredients, I generate 5 MASKED sentences (will this lead to overfitting??)
  5. How to handle the case when my ingredient is not part of the BERT vocabulary? Can something be done in that case?
  6. Is there some reference where I can start?

I would really appreciate if someone can point out any issues with my assumptions above.

thomwolf commented 5 years ago

Hi, no need to mask, just input your sequence and keep the hidden-states of the top tokens that correspond to your ingredients.

If your ingredients are not in the vocabulary, they will be split by the tokenizer in sub-word units (totally fine). Then, just use as a representation the mean or the max of the representations for all the sub-word tokens in an ingredient (ex torch.mean(output[0, 1:3, :], dim=1) if your ingredient word is made of tokens number 1 and 2 in the first example of the batched input sequence).

timoderbeste commented 5 years ago

Hi, no need to mask, just input your sequence and keep the hidden-states of the top tokens that correspond to your ingredients.

If your ingredients are not in the vocabulary, they will be split by the tokenizer in sub-word units (totally fine). Then, just use as a representation the mean or the max of the representations for all the sub-word tokens in an ingredient (ex torch.mean(output[0, 1:3, :], dim=1) if your ingredient word is made of tokens number 1 and 2 in the first example of the batched input sequence).

I am trying to figure out how BertForMaskedLM actually works. I saw that in the example, we do not need to mask the input sequence "Hello, my dog is cute". But then in the code, I did not see the random masking taking place either. I am wondering, which word of this input sequence is then masked and where is the ground truth provided?

I am only trying to understand this because I am trying to fine tune the bert model where the task also involves predicting some masked word. And I am trying to figure out how to process the input sequence to signal the "[MASK]" and make the model predict the actual masked out word

YanglanWang commented 5 years ago

it seems that there is nothing like "run_pretraining.py" in google-research/bert written in tensorflow and the pretrained model is converted from tensorflow, right?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bnicholl commented 4 years ago

Has anyone figured out exactly how words in BERT are masked for masked LM, or where this occurs in the code? I'm trying to understand if the masked token is initialized randomly for every single epoch.

LysandreJik commented 4 years ago

That would be related to the training script. If you're using the run_lm_finetuning.py script, then these lines are responsible for the token masking.