b-chu commented 1 month ago

Curriculum learning callback

Requirements

Requires StreamingDataset
Each datamix must have a duration and train_loader
Duration must be positive and defined in terms of epochs or tokens
Duration units must match max_duration
The length of the schedule must be equal to max_duration
The part of the schedule that has already been trained on cannot be changed in future resumption runs
Features
Defines schedules by specifying datamixes in terms of training duration
Supports epochs and tokens for each iteration
Enables single run curriculum learning
Swaps the entire dataloader at the start of each iteration
Other
Refactor build_tokenizer to avoid circular dependencies
Refactor process_init_dist to avoid circular dependencies

Manual tests

Matches old callback behavior

Resumes correctly in the middle of the schedule

Resumes correctly when new datamix added to schedule

Resumes correctly when callback added after initial training run

API

Old API:

train_loader:
  <some params>
callbacks:
  curriculum_learning:
    dataset_index: 0

Start a new run

train_loader:
  <some params>
callbacks:
  curriculum_learning:
    dataset_index: 1

Start a new run

train_loader:
  <some params>
callbacks:
  curriculum_learning:
    dataset_index: 2

New API:

train_loader:
  <dataloader parameters>
callback:
  curriculum_learning:
  - duration: <number>tok
    train_loader:  # matches top level train_loader
      <dataloader parameters>
  - duration: <number>tok
    train_loader:
      <dataloader parameters>
  - duration: <number>tok
    train_loader:
      <dataloader parameters>

snarayan21 commented 3 weeks ago

@b-chu about the new API, couple questions:

train_loader:
  <some params>
callbacks:
  curriculum_learning:
    duration: 5000000tok
    schedule:
    - duration: 5000000tok
      train_loader:
        <some params>
    - duration: 5000000tok
      train_loader:
        <some params>

so I still have to specify train_loader as a top-level entry?
the first duration specified is for the top-level train_loader?

snarayan21 commented 3 weeks ago

Also, I'm worried about the loss curves in the plots you shared, they don't look fully deterministic to me. What model size and batch size were you running at, and with which datasets? Longer training runs with a bigger model and small batch size, without shuffling, would be helpful so that we can determine if the loss curves are actually deterministic or not. Just looking at the first few steps most training runs will look pretty similar regardless of the data ordering.

b-chu commented 2 weeks ago

Yes, this needs a composer release. I'll rerun cicd after that release and before merging. Yes, train_loader is specified still and curriculum_learning.duration is its duration. We discussed offline with data team and they'll try the callback later when doing a longer training run. I think there's slight discrepancies in rng when running on interactive, but comparing to a run with no CL callback, the new callback matches the loss exactly while the old callback is slightly different. Also when comparing two different datasets/splits the loss is much greater than the plots above.

mosaicml / llm-foundry

Add curriculum learning callback #1256