Closed nathanb97 closed 2 years ago
Hi @nathanb97, this seems like a transformers
issue since:
CamembertTokenizerFast
.That being said I think I can answer directly:
Output 6
is correctly located, since padding needs to happen only at the full boundaries of text, including 5
and 6
which are EOS, BOS tokens indicating that the sentence is starting, ending.
[5, ..., 6, pad, pad, pad] is always correct.
Padding is also additionally used with attention_mask
which is also needed to get fully correct results on most models.
Doing
tokenizer("This is a test", padding="max_length", max_length = 20, return_tensors="pt")
# {'input_ids': [5, 17526, 2856, 33, 2006, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
for instance will give you what you need, then
model_inputs = tokenizer("This is a test", padding="max_length", max_length = 20, return_tensors="pt")
outputs = model(**model_inputs)
should work out of the box.
Hello,
Thank you very much for your answer. I close the issue.
As indicated in your documentation, I encode a sequence in order to exploit the embedding of your model.
But I couldn't find a pad function.
I can do it myself but I don't know if I should leave token 6 at the end or token 1 (pad token)?
:
output obtained :
But maybe we must have the token "6" at the end: