jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.64k stars 446 forks source link

Question Regarding the Absence of BOS and EOS Tokens in Tokenizer Encoding #40

Closed dtxwhzw closed 12 months ago

dtxwhzw commented 12 months ago

I noticed that your tokenizer doesn't add the bos and eos token to the final tensor during encoding. Does this have any impact on pretraining? If it's intentional not to add them, what is the reason behind it?

jzhang38 commented 12 months ago

It does add sep token: https://github.com/jzhang38/TinyLlama/blob/3322601dd28788a64e6b2085d6870aa167b5a264/lit_gpt/packed_dataset.py#L95

dtxwhzw commented 12 months ago

It does add sep token:

https://github.com/jzhang38/TinyLlama/blob/3322601dd28788a64e6b2085d6870aa167b5a264/lit_gpt/packed_dataset.py#L95

Since the self._idx starts from 0. I think all the bos token in the self._arr would be replaced by the input arr: https://github.com/jzhang38/TinyLlama/blob/3322601dd28788a64e6b2085d6870aa167b5a264/lit_gpt/packed_dataset.py#L114 Secondly, it's confirmed that there is no EOS token, right?

jzhang38 commented 12 months ago

Yes, your understanding is correct and my previous response is not the right answer to your question. I rechecked the actual code used for pretraining and I actually set the bos to True by default in the below line. So there is a sep token.

https://github.com/jzhang38/TinyLlama/blob/3322601dd28788a64e6b2085d6870aa167b5a264/lit_gpt/tokenizer.py#L54

Secondly, it's confirmed that there is no EOS token, right?

There is no EOS token.