huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.03k stars 796 forks source link

Pad encoded input with CamembertTokenizer #883

Closed nathanb97 closed 2 years ago

nathanb97 commented 2 years ago

As indicated in your documentation, I encode a sequence in order to exploit the embedding of your model.
But I couldn't find a pad function.

from transformers import CamembertModel, CamembertTokenizer

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
camembert = CamembertModel.from_pretrained("camembert-base")

I can do it myself but I don't know if I should leave token 6 at the end or token 1 (pad token)?
:

tokenized_sentence = tokenizer.tokenize("J'aime le camembert!")
encoded_sentence = tokenizer.encode(tokenized_sentence)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)

encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
encoded_sentence = torch.tensor(
    pad_sequences(encoded_sentence, maxlen=max_lengh, value=tokenizer.pad_token_id, padding='post')
)
embeddings = camembert(encoded_sentence)[0]

output obtained :

tensor([[    5,   121,    11,   660,    16,   730, 25543,   110,   152,     6,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1]], dtype=torch.int32)

But maybe we must have the token "6" at the end:

tensor([[    5,   121,    11,   660,    16,   730, 25543,   110,   152,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1, 6]], dtype=torch.int32)
Narsil commented 2 years ago

Hi @nathanb97, this seems like a transformers issue since:

That being said I think I can answer directly:

Output 6 is correctly located, since padding needs to happen only at the full boundaries of text, including 5 and 6 which are EOS, BOS tokens indicating that the sentence is starting, ending.

[5, ..., 6, pad, pad, pad] is always correct. Padding is also additionally used with attention_mask which is also needed to get fully correct results on most models. Doing

tokenizer("This is a test", padding="max_length", max_length = 20, return_tensors="pt")
# {'input_ids': [5, 17526, 2856, 33, 2006, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

for instance will give you what you need, then

model_inputs = tokenizer("This is a test", padding="max_length", max_length = 20, return_tensors="pt")
outputs = model(**model_inputs)

should work out of the box.

nathanb97 commented 2 years ago

Hello,

Thank you very much for your answer. I close the issue.