Closed jaideepr97 closed 2 weeks ago
@Maxusmusti Can you confirm that we don't need any <|end_of_text|>
tokens in the pre-training samples here, like the chat template at https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188 uses after the message from each role? If we need to follow that format exactly, we'd also need these <|end_of_text|>
tokens after the text from 1 role before starting the new role tokens?
This pull request has merge conflicts that must be resolved before it can be merged. @jaideepr97 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@bbrowning yeah you don't need to add the end_of_text token, that gets added by the chat template: https://github.com/instructlab/training/pull/319/files#diff-a8438361fc1435b584fec100fc73bd5bdc7856dc9826a570dd9dc7a6321f9bbcR30
This PR adds support for converting messages datasets into multiple pre-training formats to support working with both granite 7b and granite 3.0 student models. It accepts a
use_legacy_pretraining_format
parameter as input to appropriately choose the right format to useThis is intended to be a short term solution, with the long term idea being that SDG would be agnostic of student model requirements such as these