Unexpected/wrong handling of added special tokens in special_tokens_mask (GPT1, BERT, possibly others)

matejklemen commented 4 years ago

Environment info

transformers version: 3.1.0
Platform: Linux-5.4.0-52-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.6.0+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Most appropriate seems @mfuntowicz (tokenization), blame says @thomwolf.

Information

Model I am using (Bert, XLNet ...): OpenAI GPT (also BERT)

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

I am adding special tokens (BOS, SEP and EOS) to GPT1 tokenizer in order to format and fine-tune a GPT model a bit differently. I am also making use of the convenient return_special_tokens_mask argument in encode_plus(), though it does not seem to mark the added custom special tokens as special in the returned mask.

The same is also true when adding custom special tokens to BERT tokenizer. I did not check beyond these two. The problem for GPT seems to be that get_special_tokens_mask() in tokenization_utils.py does not seem to take into account any special tokens.

def get_special_tokens_mask(
    self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
) -> List[int]:
    return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))

For BERT, it only seems to take into account [CLS] and [SEP].

To reproduce

from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
tokenizer.add_special_tokens({
    "bos_token": "<bos>",
    "sep_token": "<sep>",
    "eos_token": "<eos>"
})

# Does not work this way either
# tokenizer.add_special_tokens({
#     "additional_special_tokens": ["<bos>", "<sep>", "<eos>"]
# })

encoded = tokenizer.encode_plus("<bos> State your name, rank and intention <sep> The Doctor, doctor, fun. <eos>",
                                return_special_tokens_mask=True)
print(encoded["input_ids"])
print(encoded["special_tokens_mask"])  # This returns all zeros

Expected behavior

I would expect that the additional special tokens also get marked as special, i.e. that the special_tokens_mask in above snippet returns [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

axel-op commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Keep it open

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

axel-op commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

👋

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

matejklemen commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Bump

huggingface / transformers