LLaMA3 supports an 8K token context length. When continuously pretraining with proprietary data, the majority of the text data is significantly shorter than 8K tokens, resulting in a substantial amount of padding. To enhance training efficiency and effectiveness, it is necessary to merge multiple short texts into a longer text, with the length remaining below 8K tokens. However, the question arises: how should these short texts be combined into a single training sequence? Should they be separated by delimiters, or should an approach involving masking be used during the pretraining process? #1128
LLaMA3 supports an 8K token context length. When continuously pretraining with proprietary data, the majority of the text data is significantly shorter than 8K tokens, resulting in a substantial amount of padding. To enhance training efficiency and effectiveness, it is necessary to merge multiple short texts into a longer text, with the length remaining below 8K tokens. However, the question arises: how should these short texts be combined into a single training sequence? Should they be separated by delimiters, or should an approach involving masking be used during the pretraining process?
Regarding the use of delimiters, as seen in GPT2 during its pretraining phase, multiple short texts were combined into a longer text using the [SEP] token. However, LLaMA3’s tokenizer does not define a [SEP] token or a similar one. It includes two stop tokens: <|end_of_text|> and <|eot_id|>, where the former acts like an EOS token, and the latter serves as an end token for each turn in a dialogue. Should<|end_of_text|> or <|eot_id|> be used as the delimiter during training, or should a new delimiter be custom-defined?
As for the masking approach, it is inspired by a method described in the LLaMA3 official blog, which states, "We trained the models on sequences of 8,192 tokens using a mask to ensure self-attention does not cross document boundaries." Does this imply that LLaMA3 does not use explicitly defined short text delimiters to merge multiple texts, but instead combines them using <|end_of_text|> and <|end_of_text|>, then masks other short texts during the pretraining to facilitate model training?
LLaMA3 supports an 8K token context length. When continuously pretraining with proprietary data, the majority of the text data is significantly shorter than 8K tokens, resulting in a substantial amount of padding. To enhance training efficiency and effectiveness, it is necessary to merge multiple short texts into a longer text, with the length remaining below 8K tokens. However, the question arises: how should these short texts be combined into a single training sequence? Should they be separated by delimiters, or should an approach involving masking be used during the pretraining process? Regarding the use of delimiters, as seen in GPT2 during its pretraining phase, multiple short texts were combined into a longer text using the [SEP] token. However, LLaMA3’s tokenizer does not define a [SEP] token or a similar one. It includes two stop tokens: <|end_of_text|> and <|eot_id|>, where the former acts like an EOS token, and the latter serves as an end token for each turn in a dialogue. Should<|end_of_text|> or <|eot_id|> be used as the delimiter during training, or should a new delimiter be custom-defined? As for the masking approach, it is inspired by a method described in the LLaMA3 official blog, which states, "We trained the models on sequences of 8,192 tokens using a mask to ensure self-attention does not cross document boundaries." Does this imply that LLaMA3 does not use explicitly defined short text delimiters to merge multiple texts, but instead combines them using <|end_of_text|> and <|end_of_text|>, then masks other short texts during the pretraining to facilitate model training?
Publicación original de @guxungang en https://github.com/meta-llama/llama-recipes/issues/538