jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.3k stars 425 forks source link

Pretraining failing on IndexError: list index out of range in file packed_dataset.py #171

Closed databillm closed 3 months ago

databillm commented 3 months ago

Hi, thanks for the excellent work you've put together. I am trying to pretrain on my own dataset which is significantly smaller than slimpajama or startcoder. Dataset preparation steps works fine and creates the .bin files. I have changed the chunk_size: int = 2049 * 2 in prepare_slimpajama.py. Unfortunately pretrain setup fails. Here's the command and the error trace. My setup is a single PC with a single RTX 3090.

python pretrain/tinyllama.py --devices 1 --train_data_dir /home/me/my-code/training/ustad/transcripts/4_tokenized --val_data_dir /home/me/my-code/training/ustad/transcripts/4_tokenized

Output

 
Using bfloat16 Automatic Mixed Precision (AMP)
{'model_name': 'ustad', 'name': 'ustad-test1', 'num_of_devices': 1, 'global_batch_size': 48, 'learning_rate': 0.0004, 'micro_batch_size': 8, 'max_step': 1430512, 'warmup_steps': 2000, 'log_step_interval': 10, 'eval_iters': 100, 'save_step_interval': 5000, 'eval_step_interval': 5000, 'weight_decay': 0.1, 'beta1': 0.9, 'beta2': 0.95, 'grad_clip': 1.0, 'decay_lr': True, 'min_lr': 4e-05, 'batch_size': 48, 'gradient_accumulation_steps': 6, 'warmup_iters': 12000, 'max_iters': 8583072, 'lr_decay_iters': 8583072, 'log_iter_interval': 60}
Seed set to 3407
Loading model with {'org': 'bill', 'name': 'ustad', 'block_size': 2048, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 22, 'n_head': 32, 'n_embd': 2048, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 4, 'shared_attention_norm': False, '_norm_class': 'FusedRMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 5632, 'condense_ratio': 1}
Time to instantiate model: 0.12 seconds.
Total parameters 1,100,048,384
Validating ...
Estimated TFLOPs: 138.38
Traceback (most recent call last):
  File "/home/me/my-code/training/ustad/TinyLlama-main/pretrain/tinyllama.py", line 397, in 
    CLI(setup)
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/jsonargparse/_cli.py", line 193, in _run_component
    return component(**cfg)
  File "/home/me/my-code/training/ustad/TinyLlama-main/pretrain/tinyllama.py", line 108, in setup
    main(fabric, train_data_dir, val_data_dir, resume)
  File "/home/me/my-code/training/ustad/TinyLlama-main/pretrain/tinyllama.py", line 160, in main
    train(fabric, state, train_dataloader, val_dataloader, monitor, resume)
  File "/home/me/my-code/training/ustad/TinyLlama-main/pretrain/tinyllama.py", line 199, in train
    for  train_data in train_dataloader:
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/lightning/fabric/wrappers.py", line 274, in __iter__
    for item in self._dataloader:
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
    return self._get_iterator()
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
    return _SingleProcessDataLoaderIter(self)
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 670, in __init__
    self._dataset_fetcher = _DatasetKind.create_fetcher(
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 79, in create_fetcher
    return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
  File "/home/me/miniconda3/envs/ustad/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 21, in __init__
    self.dataset_iter = iter(dataset)
  File "/home/me/my-code/training/ustad/TinyLlama-main/lit_gpt/packed_dataset.py", line 228, in __iter__
    return CombinedDatasetIterator(self._datasets, self._seed, self._weights)
  File "/home/me/my-code/training/ustad/TinyLlama-main/lit_gpt/packed_dataset.py", line 233, in __init__
    self._datasets = [iter(el) for el in datasets]
  File "/home/me/my-code/training/ustad/TinyLlama-main/lit_gpt/packed_dataset.py", line 233, in 
    self._datasets = [iter(el) for el in datasets]
  File "/home/me/my-code/training/ustad/TinyLlama-main/lit_gpt/packed_dataset.py", line 52, in __iter__
    return PackedDatasetIterator(
  File "/home/me/my-code/training/ustad/TinyLlama-main/lit_gpt/packed_dataset.py", line 150, in __init__
    self._load_n_chunks()
  File "/home/me/my-code/training/ustad/TinyLlama-main/lit_gpt/packed_dataset.py", line 179, in _load_n_chunks
    filename = self._filenames[self._file_idx + i]
IndexError: list index out of range

I do not fully understand the chunk_size, batch_size, max_step et al. parameters. I've only change these 2

num_of_devices = 1
global_batch_size = 48
Please help.

databillm commented 3 months ago

Since only 1 type of data "slim" was being used hence removing the reference to "star" in train_data_config list in tinyllama.py fixed the issue.