huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.13k stars 27.04k forks source link

Unexpected/wrong handling of added special tokens in special_tokens_mask (GPT1, BERT, possibly others) #7951

Open matejklemen opened 4 years ago

matejklemen commented 4 years ago

Environment info

Who can help

Most appropriate seems @mfuntowicz (tokenization), blame says @thomwolf.

Information

Model I am using (Bert, XLNet ...): OpenAI GPT (also BERT)

The problem arises when using:

The tasks I am working on is:

I am adding special tokens (BOS, SEP and EOS) to GPT1 tokenizer in order to format and fine-tune a GPT model a bit differently. I am also making use of the convenient return_special_tokens_mask argument in encode_plus(), though it does not seem to mark the added custom special tokens as special in the returned mask.

The same is also true when adding custom special tokens to BERT tokenizer. I did not check beyond these two. The problem for GPT seems to be that get_special_tokens_mask() in tokenization_utils.py does not seem to take into account any special tokens.

def get_special_tokens_mask(
    self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
) -> List[int]:
    return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))

For BERT, it only seems to take into account [CLS] and [SEP].

To reproduce

from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
tokenizer.add_special_tokens({
    "bos_token": "<bos>",
    "sep_token": "<sep>",
    "eos_token": "<eos>"
})

# Does not work this way either
# tokenizer.add_special_tokens({
#     "additional_special_tokens": ["<bos>", "<sep>", "<eos>"]
# })

encoded = tokenizer.encode_plus("<bos> State your name, rank and intention <sep> The Doctor, doctor, fun. <eos>",
                                return_special_tokens_mask=True)
print(encoded["input_ids"])
print(encoded["special_tokens_mask"])  # This returns all zeros

Expected behavior

I would expect that the additional special tokens also get marked as special, i.e. that the special_tokens_mask in above snippet returns [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

axel-op commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Keep it open

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

axel-op commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

👋

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

matejklemen commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Bump