Closed dtxwhzw closed 12 months ago
It does add sep token:
Since the self._idx starts from 0. I think all the bos token in the self._arr would be replaced by the input arr: https://github.com/jzhang38/TinyLlama/blob/3322601dd28788a64e6b2085d6870aa167b5a264/lit_gpt/packed_dataset.py#L114 Secondly, it's confirmed that there is no EOS token, right?
Yes, your understanding is correct and my previous response is not the right answer to your question. I rechecked the actual code used for pretraining and I actually set the bos to True by default in the below line. So there is a sep token.
Secondly, it's confirmed that there is no EOS token, right?
There is no EOS token.
I noticed that your tokenizer doesn't add the bos and eos token to the final tensor during encoding. Does this have any impact on pretraining? If it's intentional not to add them, what is the reason behind it?