huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

Example code does not work. #121

Closed codingchild2424 closed 6 months ago

codingchild2424 commented 6 months ago

Thank you for sharing the amazing repo to community.

When I try to use examples, this error occured. How to set up for tutorial code?

[My script] torchrun --nproc_per_node=2 run_train.py --config-file examples/config_tiny_llama.yaml

[Error Logs] Traceback (most recent call last): File "/root/nanotron/run_train.py", line 156, in trainer = DistributedTrainer(config_file) File "/root/nanotron/src/nanotron/trainer.py", line 127, in init self.config = get_config_from_file( File "/root/nanotron/src/nanotron/config/config.py", line 432, in get_config_from_file config = get_config_from_dict( File "/root/nanotron/src/nanotron/config/config.py", line 393, in get_config_from_dict return from_dict( File "/usr/local/lib/python3.10/dist-packages/dacite/core.py", line 81, in from_dict instance = data_class(**init_values) File "", line 14, in init File "/root/nanotron/src/nanotron/config/config.py", line 336, in __post_init__ assert all( AssertionError: You must have a training stage starting at 1 in the config's data_stages

xrsrke commented 6 months ago

Hello. Could you try again with the current main branch? I think we've just fixed it

xrsrke commented 6 months ago

ah yes. we have a new fix. not merged yet https://github.com/huggingface/nanotron/pull/120

TJ-Solergibert commented 6 months ago

Check this commit. The assertion can't be all, as you are expecting to have 1 data stage starting at 1.

codingchild2424 commented 6 months ago

Thanks:)