AssertionError related to tied parameters during `train_tiny_llama.sh` execution

xffxff commented 8 months ago

I encountered an error while executing the examples/train_tiny_llama.sh script from commit ff3c7746577948743da08c4868aca46cbc0c110b, without any modifications to the code or configurations, on an 8-GPU node. Below is the error log:

[default5]:    self._mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default7]:  File "nanotron/src/nanotron/trainer.py", line 767, in _mark_tied_parameters
[default6]:    self._mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default4]:    len(dp_ranks) == 1
[default5]:  File "nanotron/src/nanotron/trainer.py", line 767, in _mark_tied_parameters
[default6]:  File "nanotron/src/nanotron/trainer.py", line 767, in _mark_tied_parameters
[default7]:    mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default5]:    mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default4]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
[default6]:    mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default7]:  File "nanotron/src/nanotron/trainer.py", line 790, in mark_tied_parameters
[default5]:  File "nanotron/src/nanotron/trainer.py", line 790, in mark_tied_parameters
[default5]:    tie_parameters(
[default6]:  File "nanotron/src/nanotron/trainer.py", line 790, in mark_tied_parameters
[default7]:    tie_parameters(
[default6]:    tie_parameters(
[default5]:  File "nanotron/src/nanotron/parallel/tied_parameters.py", line 60, in tie_parameters
[default7]:  File "nanotron/src/nanotron/parallel/tied_parameters.py", line 60, in tie_parameters
[default6]:  File "nanotron/src/nanotron/parallel/tied_parameters.py", line 60, in tie_parameters
[default5]:    len(dp_ranks) == 1
[default7]:    len(dp_ranks) == 1
[default6]:    len(dp_ranks) == 1
[default5]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
[default7]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
[default6]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)

NouamaneTazi commented 8 months ago

thanks for reporting this! Can you try with this fix: https://github.com/huggingface/nanotron/pull/100

xffxff commented 8 months ago

Sorry. I cannot reproduce the issue at this moment. It's possible that a previous configuration error on my part led to the problem. I will close this issue now

huggingface / nanotron

AssertionError related to tied parameters during `train_tiny_llama.sh` execution #101