Closed SinanAkkoyun closed 1 year ago
I'm curious, what does padding the token id's accomplish?
Padding with padding tokens allows you to run multiple sequences of different length in the same batch. "Padding" in this PR really more prepending and appending the BOS and EOS tokens, respectively. Maybe the arguments should be called add_bos
and add_eos
to avoid this confusion?
Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add <s>
and </s>
at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.
Maybe the arguments should be called add_bos and add_eos to avoid this confusion?
Sure, will change!
Also, this is more of a workaround and I really would prefer to fully support encoding of control symbols, like in AutoTokenizer. I.e. you should just be able to add
andat arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.
I totally get that hence the question a couple of days prior to why sentencepiece, but llama2s repo is doing exactly this, appending eos and bos, tokenizing each 'role' prompt and concatenating all of those encoded IDs to a full prompt.
I wanted to implement exactly the same without messing with the tokenizer, so I believe that this change should also be merged in conjunction to tokenzier changes. (and doesn't decoding output the eos and bos strings?)
I will later also commit a PR of the exact llama 2 chat completion implementation as it itself is using sentencepiece and bos eos appending and I would like this tokenizer feature to be added for that, would that be ok with you?
This PR just adds the optional functionality to the tokenzier to pad EOS and BOS token IDs (important for some chat formats like llama2 and openassistant)