feat: support converting messages datasets into multiple pre-training formats

instructlab / sdg

Python library for Synthetic Data Generation

https://pypi.org/project/instructlab-sdg/

Apache License 2.0

23 stars 36 forks source link

feat: support converting messages datasets into multiple pre-training formats #341

Closed jaideepr97 closed 2 weeks ago

jaideepr97 commented 2 weeks ago

This PR adds support for converting messages datasets into multiple pre-training formats to support working with both granite 7b and granite 3.0 student models. It accepts a use_legacy_pretraining_format parameter as input to appropriately choose the right format to use

This is intended to be a short term solution, with the long term idea being that SDG would be agnostic of student model requirements such as these

bbrowning commented 2 weeks ago

@Maxusmusti Can you confirm that we don't need any <|end_of_text|> tokens in the pre-training samples here, like the chat template at https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188 uses after the message from each role? If we need to follow that format exactly, we'd also need these <|end_of_text|> tokens after the text from 1 role before starting the new role tokens?

mergify[bot] commented 2 weeks ago

This pull request has merge conflicts that must be resolved before it can be merged. @jaideepr97 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Maxusmusti commented 2 weeks ago

@bbrowning yeah you don't need to add the end_of_text token, that gets added by the chat template: https://github.com/instructlab/training/pull/319/files#diff-a8438361fc1435b584fec100fc73bd5bdc7856dc9826a570dd9dc7a6321f9bbcR30