huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

PP allocation issue #108

Closed jordane95 closed 7 months ago

jordane95 commented 8 months ago

Running the latest tiny llama conf with larger vocab size would raise the following error

[default5]:Traceback (most recent call last):
[default5]:  File "/output/nanotron/run_train.py", line 136, in <module>
[default5]:    trainer = DistributedTrainer(config_file)
[default5]:  File "/output/nanotron/src/nanotron/trainer.py", line 163, in __init__
[default5]:    self.model = self.init_model()  # Defines self.model
[default5]:  File "/output/nanotron/src/nanotron/trainer.py", line 531, in init_model
[default4]:03/16/2024 09:10:16 [INFO|DP=0|PP=1|TP=0]: Local number of parameters: 0 (0.00MiB)
[default4]:Traceback (most recent call last):
[default5]:    model = self._init_model_instance()
[default4]:03/16/2024 09:10:16 [INFO|DP=0|PP=1|TP=0]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.01MiB Peak reserved: 2.00MiB
[default4]:  File "/output/nanotron/run_train.py", line 136, in <module>
[default5]:  File "/output/nanotron/src/nanotron/trainer.py", line 541, in _init_model_instance
[default5]:    model = self._init_model(
[default5]:  File "/output/nanotron/src/nanotron/trainer.py", line 665, in _init_model
[default5]:    check_model_has_grad(model=model, parallel_context=parallel_context)
[default4]:    trainer = DistributedTrainer(config_file)
[default5]:  File "/output/nanotron/src/nanotron/models/base.py", line 282, in check_model_has_grad
[default5]:    raise ValueError(
[default4]:  File "/output/nanotron/src/nanotron/trainer.py", line 163, in __init__
[default5]:ValueError: Can't use DDP because model in PP=1 has no gradient. Consider increasing the number of layers of your model, or put a smaller PP size.
[default4]:    self.model = self.init_model()  # Defines self.model
[default5]:Model: LlamaForTraining(
[default4]:  File "/output/nanotron/src/nanotron/trainer.py", line 531, in init_model
[default5]:  (model): LlamaModel(
[default4]:    model = self._init_model_instance()
[default5]:    (token_position_embeddings): PipelineBlock(pp_rank=0)
[default4]:  File "/output/nanotron/src/nanotron/trainer.py", line 541, in _init_model_instance
[default5]:    (decoder): ModuleList(
[default4]:    model = self._init_model(
[default5]:      (0-1): 2 x PipelineBlock(pp_rank=0)
[default4]:  File "/output/nanotron/src/nanotron/trainer.py", line 665, in _init_model
[default5]:    )
[default4]:    check_model_has_grad(model=model, parallel_context=parallel_context)
[default5]:    (final_layer_norm): PipelineBlock(pp_rank=0)
[default4]:  File "/output/nanotron/src/nanotron/models/base.py", line 282, in check_model_has_grad
[default5]:    (lm_head): PipelineBlock(pp_rank=0)
[default4]:    raise ValueError(
[default5]:    (cast_to_fp32): PipelineBlock(pp_rank=1)
[default4]:ValueError: Can't use DDP because model in PP=1 has no gradient. Consider increasing the number of layers of your model, or put a smaller PP size.
[default5]:  )
[default4]:Model: LlamaForTraining(
[default5]:  (loss): PipelineBlock(
[default4]:  (model): LlamaModel(
[default5]:    pp_rank=1
[default4]:    (token_position_embeddings): PipelineBlock(pp_rank=0)
[default5]:    (pp_block): Loss()
[default4]:    (decoder): ModuleList(
[default5]:  )
[default4]:      (0-1): 2 x PipelineBlock(pp_rank=0)
[default5]:)
[default4]:    )
[default4]:    (final_layer_norm): PipelineBlock(pp_rank=0)
[default4]:    (lm_head): PipelineBlock(pp_rank=0)
[default4]:    (cast_to_fp32): PipelineBlock(pp_rank=1)
[default4]:  )
[default4]:  (loss): PipelineBlock(
[default4]:    pp_rank=1
[default4]:    (pp_block): Loss()
[default4]:  )
[default4]:)
[default7]:Traceback (most recent call last):
[default7]:  File "/output/nanotron/run_train.py", line 136, in <module>
[default7]:    trainer = DistributedTrainer(config_file)
[default7]:  File "/output/nanotron/src/nanotron/trainer.py", line 163, in __init__
[default7]:    self.model = self.init_model()  # Defines self.model
[default7]:  File "/output/nanotron/src/nanotron/trainer.py", line 531, in init_model
[default6]:Traceback (most recent call last):
[default7]:    model = self._init_model_instance()
[default6]:  File "/output/nanotron/run_train.py", line 136, in <module>
[default7]:  File "/output/nanotron/src/nanotron/trainer.py", line 541, in _init_model_instance
[default6]:    trainer = DistributedTrainer(config_file)
[default7]:    model = self._init_model(
[default6]:  File "/output/nanotron/src/nanotron/trainer.py", line 163, in __init__
[default7]:  File "/output/nanotron/src/nanotron/trainer.py", line 665, in _init_model
[default6]:    self.model = self.init_model()  # Defines self.model
[default7]:    check_model_has_grad(model=model, parallel_context=parallel_context)
[default6]:  File "/output/nanotron/src/nanotron/trainer.py", line 531, in init_model
[default7]:  File "/output/nanotron/src/nanotron/models/base.py", line 282, in check_model_has_grad
[default6]:    model = self._init_model_instance()
[default7]:    raise ValueError(
[default6]:  File "/output/nanotron/src/nanotron/trainer.py", line 541, in _init_model_instance
[default7]:ValueError: Can't use DDP because model in PP=1 has no gradient. Consider increasing the number of layers of your model, or put a smaller PP size.
[default6]:    model = self._init_model(
[default7]:Model: LlamaForTraining(
[default6]:  File "/output/nanotron/src/nanotron/trainer.py", line 665, in _init_model
[default7]:  (model): LlamaModel(
[default6]:    check_model_has_grad(model=model, parallel_context=parallel_context)
[default7]:    (token_position_embeddings): PipelineBlock(pp_rank=0)
[default6]:  File "/output/nanotron/src/nanotron/models/base.py", line 282, in check_model_has_grad
[default7]:    (decoder): ModuleList(
[default6]:    raise ValueError(
[default7]:      (0-1): 2 x PipelineBlock(pp_rank=0)
[default6]:ValueError: Can't use DDP because model in PP=1 has no gradient. Consider increasing the number of layers of your model, or put a smaller PP size.
[default7]:    )
[default6]:Model: LlamaForTraining(
[default7]:    (final_layer_norm): PipelineBlock(pp_rank=0)
[default6]:  (model): LlamaModel(
[default7]:    (lm_head): PipelineBlock(pp_rank=0)
[default7]:    (cast_to_fp32): PipelineBlock(pp_rank=1)
[default6]:    (token_position_embeddings): PipelineBlock(pp_rank=0)
[default7]:  )
[default6]:    (decoder): ModuleList(
[default7]:  (loss): PipelineBlock(
[default6]:      (0-1): 2 x PipelineBlock(pp_rank=0)
[default6]:    )
[default6]:    (final_layer_norm): PipelineBlock(pp_rank=0)
[default7]:    pp_rank=1
[default7]:    (pp_block): Loss()
[default7]:  )
[default6]:    (lm_head): PipelineBlock(pp_rank=0)
[default6]:    (cast_to_fp32): PipelineBlock(pp_rank=1)
[default7]:)
[default6]:  )
[default6]:  (loss): PipelineBlock(
[default6]:    pp_rank=1
[default6]:    (pp_block): Loss()
[default6]:  )
[default6]:)

Might be something related to uncorrect pp allocations?

zirui commented 7 months ago

Same issue here. Any updates?

3outeille commented 7 months ago

Sorry for the silence, a lot of stuff currently to do. Basically, the way it splits blocks accros gpus is by calling the get_block_compute_costs().

    def get_block_compute_costs(self):
        """Computes the compute cost of each block in the model so that we can do a better job of load balancing."""
        model_config = self.config
        d_ff = model_config.intermediate_size
        d_qkv = model_config.hidden_size // model_config.num_attention_heads
        block_compute_costs = {
            # CausalSelfAttention (qkv proj + attn out) + MLP
            LlamaDecoderLayer: 4 * model_config.num_attention_heads * d_qkv * model_config.hidden_size
            + 3 * d_ff * model_config.hidden_size,
            # This is the last lm_head
            TensorParallelColumnLinear: model_config.vocab_size * model_config.hidden_size,
        }
        return block_compute_costs

Since the tiny llama has very small dimensions, this will results in miss-allocations. You can increase the dimension to make it work, i.e:

model_config = LlamaConfig(
    # Config for a tiny model model with 1.62M parameters
    bos_token_id=1,
    eos_token_id=2,
    hidden_act="silu",
    hidden_size=1024,
    initializer_range=0.02,
    intermediate_size=1024,
    max_position_embeddings=50277,
    num_attention_heads=4,
    num_hidden_layers=12,
    num_key_value_heads=4,
    pretraining_tp=1,
    rms_norm_eps=1e-05,
    rope_scaling=None,
    tie_word_embeddings=True,
    use_cache=True,
    vocab_size=50277,
)

If it is for learning purpose, you can manually set the value of splitting like this

    def get_block_compute_costs(self):
        """Computes the compute cost of each block in the model so that we can do a better job of load balancing."""
        model_config = self.config

        block_compute_costs = {
            # CausalSelfAttention (qkv proj + attn out) + MLP
            LlamaDecoderLayer: 1,
            # This is the last lm_head
            TensorParallelColumnLinear: 0,
        }
        return block_compute_costs
jordane95 commented 6 months ago

@3outeille Shoudn't we solve this corner case by at least allocating one layer of parameters on each pp rank?