Closed chowkamlee81 closed 1 month ago
Thanks for the question. We've provided code in the adaptllm repo to convert raw corpora into a reading comprehension format. After that, you’ll need to mix the converted data with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).
Except for the pre-training data, our pre-training process is the same as the vanilla pre-training of language models. You may refer to our pre-training suggestions or this issue for more details.
For AdaptLLM, where we can find training code. Only inference codes are provided