xiulinyang commented 1 month ago

Hi,

I'm interested in using bert family models to predict the probability of a masked word. I found two possible approaches online but they gave me different results.

Approach one

from transformers import pipeline
nlp = pipeline("fill-mask", model="roberta-base")
nlp(f"This is the best thing I've {nlp.tokenizer.mask_token} in my life.", targets= ['done', 'seen'])

Result:

[{'score': 2.884598302443919e-07,
  'token': 27057,
  'token_str': 'done',
  'sequence': "This is the best thing I'vedone in my life."},
 {'score': 1.1685681755579935e-07,
  'token': 24196,
  'token_str': 'seen',
  'sequence': "This is the best thing I'veseen in my life."}]

Approach Two

from transformers import BertTokenizer, BertForMaskedLM
import torch
import torch.nn.functional as F
from pathlib import Path
from tqdm import tqdm
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForMaskedLM.from_pretrained("roberta-base")

model.eval()  # Put the model in evaluation mode
text = f"This is the best thing I've {tokenizer.mask_token} in my life."
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
mask_index = tokenized_text.index(' <mask>')
# Convert to tensors
tokens_tensor = torch.tensor([indexed_tokens])

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs.logits

print(predictions)
# Apply softmax to get probabilities for the masked token
softmax_probs = F.softmax(predictions[0, mask_index], dim=-1)
done_id = tokenizer.convert_tokens_to_ids('done')
seen_id = tokenizer.convert_tokens_to_ids('seen')
print(softmax_probs[done_id].item())
print(softmax_probs[seen_id].item())

Result:

2.2915602926332213e-07
3.6690291693730614e-08

My questions are: (1) Doesn't score in pipeline mean probability of the masked token?

(2) If the second method is also correct (I'm not very sure though), what causes the discrepancy? I'm concerned about this discrepancy because in some of my experiments, I got opposite results (i.e., the first approach assigns a higher score to one token, while the second approach assigns a higher score to the other). In this case, which method should I trust?

(3) Does the space preceding in mask_index = tokenized_text.index(' <mask>') matter? When I removed the space, the tokenizer couldn't retrieve the index of the mask token. When I added space to 'done' and 'seen' in approach one, I got way higher scores, but approach 2 gave me way lower scores. (I'm working on Chinese so space is not an issue, but just in case it may cause problem, I think I'd better report it here).

(4) Do BERT family models 'understand' instructions? Like if I ask BERT to fill the mask in the following sentences:

He is a great guy. In this sentence, the emotion is [MASK]
The sentence 'This is the person who I like. ' is a [] clause.
In the sentence 'Mary knew Jack liked her', her refers to [].

Thanks! :)

ashishpatel26 commented 1 month ago

Question: Why are the scores from the two approaches different in predicting the probability of a masked word?

When working with BERT family models to predict the probability of a masked word, discrepancies can arise between different methods due to various factors related to tokenization, model configuration, and implementation details. This can be particularly confusing for users who expect consistent results from seemingly equivalent approaches. Here’s a detailed explanation to address why these discrepancies occur and how to interpret the results.

Background

Tokenization Differences

Tokenization is the process of converting a sequence of characters into a sequence of tokens. Each method may handle tokenization differently, especially when it comes to special tokens like [MASK]. The handling of spaces and special tokens can lead to differences in how the input text is processed and subsequently how probabilities are calculated.

Model Configuration

The configuration of the model, including its evaluation mode and how it processes inputs, can also lead to differences. The pipeline method provided by Hugging Face abstracts many details and ensures that the model is correctly set up for the task. In contrast, a manual setup requires careful handling to ensure all configurations are correctly applied.

Implementation Details

The implementation details, such as how softmax is applied and how logits are processed, can vary between the two approaches. These small differences can accumulate, leading to noticeable discrepancies in the output probabilities.

Why Different Results?

The scores differ because of the following reasons:

Preprocessing Steps: The pipeline method includes preprocessing steps that are not immediately apparent, which ensure consistency in how inputs are tokenized and processed.
Manual Method Variability: In the manual method, there is more room for variability and errors in how token indices are calculated, how tensors are constructed, and how the model's outputs are processed.

Example Code and Results

Pipeline Method:

from transformers import pipeline

nlp = pipeline("fill-mask", model="roberta-base")
results = nlp(f"This is the best thing I've {nlp.tokenizer.mask_token} in my life.", targets=['done', 'seen'])
print(results)

Result:

[
    {
        'score': 2.884598302443919e-07,
        'token': 27057,
        'token_str': 'done',
        'sequence': "This is the best thing I've done in my life."
    },
    {
        'score': 1.1685681755579935e-07,
        'token': 24196,
        'token_str': 'seen',
        'sequence': "This is the best thing I've seen in my life."
    }
]

Manual Method:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForMaskedLM.from_pretrained("roberta-base")

model.eval()
text = f"This is the best thing I've {tokenizer.mask_token} in my life."
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
mask_index = tokenized_text.index(tokenizer.mask_token)
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs.logits

softmax_probs = F.softmax(predictions[0, mask_index], dim=-1)
done_id = tokenizer.convert_tokens_to_ids('done')
seen_id = tokenizer.convert_tokens_to_ids('seen')
print(softmax_probs[done_id].item())
print(softmax_probs[seen_id].item())

Result:

2.2915602926332213e-07
3.6690291693730614e-08

Recommendations

Consistency in Tokenization: Ensure that both methods use identical preprocessing and token handling steps. Pay close attention to how tokens are indexed and how spaces are handled.
Use of pipeline: The pipeline method is generally easier and abstracts away the complexities, but be aware of potential hidden preprocessing steps.
Manual Method: Provides more control but requires careful handling of tokenization and configuration.

huggingface / transformers

pipeline gives a different result than the other approach in predicting word probability #31995