Open mattf1n opened 3 days ago
@mattf1n do we also reverse the order of <bos>
and <eos>
tokens? if prompt was <bos>t1 t2 t3<eos>
, do we train on <bos>t3 t2 t1<eos>
or <eos>t3 t2 t1<bos>
?
Oh interesting question. I think we can just try both with and without and see if it makes a difference? Let's start by treating all tokens equally, i.e., no special treatment for special tokens. Also I'm surprised that a prompt would end in an EOS token?
we can't treat "all" special tokens equally(well, we can. but should we?), specifically the <pad>
tokens. We don't wanna output a bunch of <pad>
tokens at the beginning of each sentence.
Also I'm surprised that a prompt would end in an EOS token? Yes especially the t5 tokenizer adds an
</s>
token.
To reproduce:
import transformers
t = transformers.AutoTokenizer.from_pretrained(
"t5-base"
)
tokenized = t(["hi how are you"])
print(tokenized)
print(t.batch_decode(tokenized['input_ids']))
{'input_ids': [[7102, 149, 33, 25, 1]], 'attention_mask': [[1, 1, 1, 1, 1]]}
['hi how are you</s>']
1
is the </s>
token added by t5 tokenizer.
Yeah you are right. Now I think we should tokenize without special tokens, reverse order, then add special tokens.
1: it will do better, more general 2: pre-taining hurts the model.