mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

Clearer error message for unknown example type #1202

Closed milocress closed 4 months ago

milocress commented 4 months ago

Manual Tests:

ift-mpt-7b-lrhex4-hsukuh

Fails with

[rank0]: RemoteTraceback:
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3/dist-packages/multiprocess/pool.py", line 125, in
[rank0]: worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:                     ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3/dist-packages/datasets/utils/py_utils.py", line 678, in
[rank0]: _write_generator_to_queue
[rank0]:     for i, result in enumerate(func(**kwargs)):
[rank0]:   File "/usr/lib/python3/dist-packages/datasets/arrow_dataset.py", line 3517, in
[rank0]: _map_single
[rank0]:     example = apply_function_on_filtered_inputs(example, i, offset=offset)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3/dist-packages/datasets/arrow_dataset.py", line 3416, in
[rank0]: apply_function_on_filtered_inputs
[rank0]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llm-foundry/llmfoundry/data/finetuning/tasks.py", line 889, in
[rank0]: dataset_mapper
[rank0]:     return tokenize_formatted_example(example, tokenizer)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llm-foundry/llmfoundry/data/finetuning/tasks.py", line 408, in
[rank0]: tokenize_formatted_example
[rank0]:     example_format = _get_example_type(example)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/llm-foundry/llmfoundry/data/finetuning/tasks.py", line 150, in
[rank0]: _get_example_type
[rank0]:     raise UnknownExampleTypeError(str(example.keys()))
[rank0]: llmfoundry.utils.exceptions.UnknownExampleTypeError: "Found keys
[rank0]: KeysView({'prompt': 'hello, ', 'response': 'world!', 'random_extra_key': 'sup'})
[rank0]: in dataset. Unknown example type. For prompt and response finetuning, the valid
[rank0]: prompt keys are {'prompt'} and the valid response keys are {'completion',
[rank0]: 'response'}. For chat finetuning, the allowed keys are {'messages'}"
[rank0]: """

which is what we want

We have been getting this error:

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3/dist-packages/multiprocess/pool.py", line 579, in _handle_results
    task = get()
           ^^^^^
  File "/usr/lib/python3/dist-packages/multiprocess/connection.py", line 254, in recv
    return _ForkingPickler.loads(buf.getbuffer())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dill/_dill.py", line 303, in loads
    return load(file, ignore, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dill/_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dill/_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/llmfoundry/utils/exceptions.py", line 85, in __init__
    f'Found keys {example.keys()} in dataset. Unknown example type. For prompt and response '
                  ^^^^^^^^^^^^

This PR fixes this by checking if example is a string before calling keys().