huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

[Feature] Add loading different datasets based on training stages #80

Closed xrsrke closed 6 months ago

xrsrke commented 7 months ago

Reproduce

Use a single dataset for the entire training

data:
  dataset:
      dataset_overwrite_cache: false
      dataset_processing_num_proc_per_process: 1
      hf_dataset_config_name: null
      hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
      hf_dataset_splits: train
      text_column_name: completion

Use different datasets based on training stages

  # NOTE: if you wanna use different datasets for different stages of the training
data:
  dataset_stages:
    - name: Stable Training Stage
      training_steps: 1
      dataset:
        dataset_overwrite_cache: false
        dataset_processing_num_proc_per_process: 1
        hf_dataset_config_name: null
        hf_dataset_or_datasets: HuggingFaceH4/testing_alpaca_small
        hf_dataset_splits: train
        text_column_name: completion
    - name: Annealing Phase
      training_steps: 10
      dataset:
        dataset_overwrite_cache: false
        dataset_processing_num_proc_per_process: 1
        hf_dataset_config_name: null
        hf_dataset_or_datasets: stas/c4-en-10k
        hf_dataset_splits: train
        text_column_name: text