Open hkiyomaru opened 1 month ago
Dump raw training data for the LLM-jp-3 series. For each training instance, the following fields should be included at least:
token_ids
training_step
dataset
document_ids
https://github.com/llm-jp/Megatron-LM/tree/nii-geniac-dump
Dump raw training data for the LLM-jp-3 series. For each training instance, the following fields should be included at least:
token_ids
: A list of token IDs for the training instancetraining_step
: Training step at which the training instance was processeddataset
: Name of the dataset from which the instance was sourceddocument_ids
: IDs of the documents associated with the training instance