huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

README typo, specifies a .sh as a config #166

Closed staghado closed 4 months ago

staghado commented 4 months ago

Building from main with pip install -e ".[dev]" and trying to run the tiny llama training example fails.

repro code :

python examples/config_tiny_llama.py 
torchrun --nproc_per_node=8 run_train.py --config-file examples/train_tiny_llama.sh

output:

W0508 21:37:59.519000 23456244455232 torch/distributed/run.py:757] 
W0508 21:37:59.519000 23456244455232 torch/distributed/run.py:757] *****************************************
W0508 21:37:59.519000 23456244455232 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0508 21:37:59.519000 23456244455232 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
  File "/home/said/nanotron/run_train.py", line 201, in <module>
    trainer = DistributedTrainer(config_file)
  File "/home/said/nanotron/src/nanotron/trainer.py", line 133, in __init__
    self.config = get_config_from_file(
  File "/home/said/nanotron/src/nanotron/config/config.py", line 444, in get_config_from_file
    config_dict = yaml.load(f, Loader=SafeLoader)
  File "/home/said/miniconda3/envs/nanotron/lib/python3.10/site-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "/home/said/miniconda3/envs/nanotron/lib/python3.10/site-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/home/said/miniconda3/envs/nanotron/lib/python3.10/site-packages/yaml/composer.py", line 39, in get_single_node
    if not self.check_event(StreamEndEvent):
  File "/home/said/miniconda3/envs/nanotron/lib/python3.10/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/home/said/miniconda3/envs/nanotron/lib/python3.10/site-packages/yaml/parser.py", line 171, in parse_document_start
    raise ParserError(None, None,
yaml.parser.ParserError: expected '<document start>', but found '<scalar>'
  in "examples/train_tiny_llama.sh", line 9, column 1
Traceback (most recent call last):
  File "/home/said/nanotron/run_train.py", line 201, in <module>

same for mamba

staghado commented 4 months ago

The confusion stems from the README which specifies a .sh file to the run_train.py which is obviously wrong.

from the README: torchrun --nproc_per_node=8 run_train.py --config-file examples/train_tiny_llama.sh

should be :

torchrun --nproc_per_node=8 run_train.py --config-file examples/train_tiny_llama.yaml