Dump raw training data for the LLM-jp-3 series - Githubissues

llm-jp / scripts

Apache License 2.0

1 stars 1 forks source link

Dump raw training data for the LLM-jp-3 series #46

Open hkiyomaru opened 1 month ago

hkiyomaru commented 1 month ago

Dump raw training data for the LLM-jp-3 series. For each training instance, the following fields should be included at least:

token_ids: A list of token IDs for the training instance
training_step: Training step at which the training instance was processed
dataset: Name of the dataset from which the instance was sourced
document_ids: IDs of the documents associated with the training instance

hkiyomaru commented 1 month ago

https://github.com/llm-jp/Megatron-LM/tree/nii-geniac-dump