I'd like to know whether to use eos or bos during Code Llama pre-training

meta-llama / codellama

Inference code for CodeLlama models

Other

16.05k stars 1.87k forks source link

I'd like to know whether to use eos or bos during Code Llama pre-training #206

Closed ChengMingZhang-ZTE closed 8 months ago

ChengMingZhang-ZTE commented 9 months ago

I am curious about the form of the dataset for Code Llama pre-training. I want to know whether eos or bos was used during the pre-training process. For example, the data format is {code}{EOS} or {BOS}{code}, which format is used for Code Llama pre-training?

tangbo-sh commented 8 months ago

I have same question. When pre-training llama2(code llama), which one is used as the delimiter between samples: eos or bos?

jgehring commented 8 months ago

For training, we add both BOS and EOS tokens.

tangbo-sh commented 8 months ago

@jgehring I understand, thanks for your reply.