mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

Finetuning does not work on nightly #1221

Closed eldarkurtic closed 4 months ago

eldarkurtic commented 4 months ago

Hi,

I think there is something weird going on with finetuning flow in the nightly version of llm-foundry. Trying to reproduce a finetuning run from any of the examples available in the repo (e.g. https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/7b_dolly_sft.yaml) fails with:

2024-05-17 15:37:23,686: rank1[3182501][MainThread]: INFO: __main__: Building train loader...
2024-05-17 15:37:23,686: rank1[3182501][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Waiting for local_rank 0 to finish data prep

Tokenizing dataset (num_proc=216):   0%|          | 0/24926 [00:00<?, ? examples/s]Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
    return load(file, ignore, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
    if len(args) == 16: return CodeType(*args)
                               ^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ekurtic/eldar-upstream/upstream-llm-foundry/scripts/train/train.py:653 in <module>         │
│                                                                                                  │
│   650 │   cfg = om.merge(yaml_cfg, cli_cfg)                                                      │
│   651 │   om.resolve(cfg)                                                                        │
│   652 │   assert isinstance(cfg, DictConfig)                                                     │
│ ❱ 653 │   main(cfg)                                                                              │
│   654                                                                                            │
│                                                                                                  │
│ /home/ekurtic/eldar-upstream/upstream-llm-foundry/scripts/train/train.py:518 in main             │
│                                                                                                  │
│   515 │                                                                                          │
│   516 │   # Dataloaders                                                                          │
│   517 │   log.info('Building train loader...')                                                   │
│ ❱ 518 │   train_loader = build_dataloader(                                                       │
│   519 │   │   train_loader_config,                                                               │
│   520 │   │   tokenizer,                                                                         │
│   521 │   │   device_train_batch_size,                                                           │
│                                                                                                  │
│ /home/ekurtic/eldar-upstream/upstream-llm-foundry/llmfoundry/data/dataloader.py:36 in            │
│ build_dataloader                                                                                 │
│                                                                                                  │
│   33 │   │   raise ValueError(f'Expected dataloader name to be one of {allowed}' +               │
│   34 │   │   │   │   │   │    f' but found name "{cfg.name}" in config: {cfg}')                  │
│   35 │                                                                                           │
│ ❱ 36 │   return LOADER_NAME_TO_FUNCTION[cfg.name](cfg, tokenizer, device_batch_size)             │
│   37                                                                                             │
│                                                                                                  │
│ /home/ekurtic/eldar-upstream/upstream-llm-foundry/llmfoundry/data/finetuning/dataloader.py:200   │
│ in build_finetuning_dataloader                                                                   │
│                                                                                                  │
│   197 │   │   │   │   proto_preprocessing_fn, dataset_name_or_path)                              │
│   198 │   │                                                                                      │
│   199 │   │   # Build dataset from HF.                                                           │
│ ❱ 200 │   │   dataset = dataset_constructor.build_from_hf(                                       │
│   201 │   │   │   dataset_name=dataset_name_or_path,                                             │
│   202 │   │   │   split=split,                                                                   │
│   203 │   │   │   safe_load=cfg.dataset.get('safe_load', False),                                 │
│                                                                                                  │
│ /home/ekurtic/eldar-upstream/upstream-llm-foundry/llmfoundry/data/finetuning/tasks.py:833 in     │
│ build_from_hf                                                                                    │
│                                                                                                  │
│   830 │   │                                                                                      │
│   831 │   │   if error is not None:                                                              │
│   832 │   │   │   log.error('Error during data prep')                                            │
│ ❱ 833 │   │   │   raise error                                                                    │
│   834 │   │   log.debug('All ranks finished data prep')                                          │
│   835 │   │                                                                                      │
│   836 │   │   hf_tokenization_logger.removeFilter(sequence_length_warning_filter)                │
│                                                                                                  │
│ /home/ekurtic/eldar-upstream/upstream-llm-foundry/llmfoundry/data/finetuning/tasks.py:794 in     │
│ build_from_hf                                                                                    │
│                                                                                                  │
│   791 │   │   │   num_cpus_to_use = max(1, detected_cpus_with_margin)                            │
│   792 │   │   │                                                                                  │
│   793 │   │   │   columns_to_remove = list(dataset[0].keys())                                    │
│ ❱ 794 │   │   │   tokenized_dataset = dataset.map(                                               │
│   795 │   │   │   │   dataset_mapper,                                                            │
│   796 │   │   │   │   batched=False,                                                             │
│   797 │   │   │   │   remove_columns=columns_to_remove,                                          │
│                                                                                                  │
│ /home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/datasets/arrow_dataset │
│ .py:592 in wrapper                                                                               │
│                                                                                                  │
│    589 │   │   else:                                                                             │
│    590 │   │   │   self: "Dataset" = kwargs.pop("self")                                          │
│    591 │   │   # apply actual function                                                           │
│ ❱  592 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │
│    593 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │
│    594 │   │   for dataset in datasets:                                                          │
│    595 │   │   │   # Remove task templates if a column mapping of the template is no longer val  │
│                                                                                                  │
│ /home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/datasets/arrow_dataset │
│ .py:557 in wrapper                                                                               │
│                                                                                                  │
│    554 │   │   │   "output_all_columns": self._output_all_columns,                               │
│    555 │   │   }                                                                                 │
│    556 │   │   # apply actual function                                                           │
│ ❱  557 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │
│    558 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │
│    559 │   │   # re-apply format to the output                                                   │
│    560 │   │   for dataset in datasets:                                                          │
│                                                                                                  │
│ /home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/datasets/arrow_dataset │
│ .py:3185 in map                                                                                  │
│                                                                                                  │
│   3182 │   │   │   │   │   │   total=pbar_total,                                                 │
│   3183 │   │   │   │   │   │   desc=(desc or "Map") + f" (num_proc={num_proc})",                 │
│   3184 │   │   │   │   │   ) as pbar:                                                            │
│ ❱ 3185 │   │   │   │   │   │   for rank, done, content in iflatmap_unordered(                    │
│   3186 │   │   │   │   │   │   │   pool, Dataset._map_single, kwargs_iterable=kwargs_per_job     │
│   3187 │   │   │   │   │   │   ):                                                                │
│   3188 │   │   │   │   │   │   │   if done:                                                      │
│                                                                                                  │
│ /home/ekurtic/miniconda3/envs/eldar-upstream/lib/python3.11/site-packages/datasets/utils/py_util │
│ s.py:647 in iflatmap_unordered                                                                   │
│                                                                                                  │
│   644 │   │   │   │   if _get_pool_pid(pool) != initial_pool_pid:                                │
│   645 │   │   │   │   │   pool_changed = True                                                    │
│   646 │   │   │   │   │   # One of the subprocesses has died. We should not wait forever.        │
│ ❱ 647 │   │   │   │   │   raise RuntimeError(                                                    │
│   648 │   │   │   │   │   │   "One of the subprocesses has abruptly died during map operation.   │
│   649 │   │   │   │   │   │   "To debug the error, disable multiprocessing."                     │
│   650 │   │   │   │   │   )                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

If helpful, finetuning works just fine if I manually convert the finetuning dataset into StreamingDataset format and then load it as such. But this seems a bit inconvenient to do every time when testing out new datasets. Pulling them just from HF-hub and tokenizing on the fly was a super useful feature in llm-foundry.

dakinggg commented 4 months ago

I am not able to reproduce this. The yaml you linked (and other finetuning runs) work fine for me. Could you please provide more information?

dakinggg commented 4 months ago

Closing due to inactivity. We are regularly finetuning models without issue, but please let us know if this is persistent!