# NOTE: if you wanna use different datasets for different stages of the training
data:
dataset_stages:
- name: Stable Training Stage
training_steps: 1
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
hf_dataset_splits: train
text_column_name: completion
- name: Annealing Phase
training_steps: 10
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: null
hf_dataset_or_datasets: stas/c4-en-10k
hf_dataset_splits: train
text_column_name: text
Reproduce
Use a single dataset for the entire training
Use different datasets based on training stages
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml