huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
153 stars 202 forks source link

Qwen1.5-14B finetune error #1336

Open Zjq9409 opened 2 months ago

Zjq9409 commented 2 months ago

System Info

optimum-habana              1.13.2
+-----------------------------------------------------------------------------+
| HL-SMI Version:                                hl-1.17.1-fw-51.5.0          |
| Driver Version:                                     1.17.1-78932ae          |

Information

Tasks

Reproduction

download Qwen1.5-14B weight from: https://huggingface.co/Qwen/Qwen1.5-14B

cd optimum-habana/examples/language-modeling
python ../gaudi_spawn.py \
    --world_size 8 --use_deepspeed run_clm.py \
    --model_name_or_path /data/models/Qwen1.5-7B-Chat/ \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm-xl-1 \
    --gaudi_config_name ./gaudi_config.json \
    --use_habana \
    --logging_steps 1  \
    --use_lazy_mode \
    --gradient_checkpointing \
    --use_hpu_graphs_for_inference \
    --throughput_warmup_steps 3 \
    --overwrite_output_dir \
    --deepspeed ./llama2_ds_zero3_config.json

The running error log is as follows:

[2024-09-17 07:57:31,077] [INFO] [checkpointing.py:542:forward] Activation Checkpointing Information
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:543:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:544:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:546:forward] ----Synchronization False
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:547:forward] ----Profiling time in checkpointing False
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank3]:     main()
[rank3]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank3]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank3]:     return inner_training_loop(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank3]:     loss = self.compute_loss(model, inputs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank3]:     outputs = model(**inputs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank3]:     loss = self.module(*inputs, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank3]:     outputs = self.model(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank3]:     layer_outputs = self._gradient_checkpointing_func(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank3]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank3]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank3]:     outputs = run_function(*inputs_cuda)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank3]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank3]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank3]:     attn_output = self.o_proj(attn_output)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank3]:     args_result = hook(self, args)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank3]:     self.pre_sub_module_forward_function(module)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank3]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank3]:     self.__all_gather_params(params_to_fetch, forward)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank3]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank3]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank3]:     handles = _dist_allgather_fn(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank3]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank3]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank3]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank3]:     return self.all_gather_function(output_tensor=output_tensor,
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank3]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank3]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank6]:     main()
[rank6]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank6]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank6]:     return inner_training_loop(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank6]:     tr_loss_step = self.training_step(model, inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank6]:     loss = self.compute_loss(model, inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank6]:     outputs = model(**inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank6]:     loss = self.module(*inputs, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank6]:     outputs = self.model(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank6]:     layer_outputs = self._gradient_checkpointing_func(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank6]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank6]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank6]:     outputs = run_function(*inputs_cuda)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank6]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank6]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank6]:     attn_output = self.o_proj(attn_output)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank6]:     args_result = hook(self, args)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank6]:     self.pre_sub_module_forward_function(module)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank6]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank6]:     self.__all_gather_params(params_to_fetch, forward)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank6]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank6]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank6]:     handles = _dist_allgather_fn(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank6]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank6]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank6]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank6]:     return self.all_gather_function(output_tensor=output_tensor,
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank6]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank6]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank1]:     main()
[rank1]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank1]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank1]:     loss = self.compute_loss(model, inputs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank1]:     outputs = model(**inputs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank1]:     loss = self.module(*inputs, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank1]:     outputs = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank1]:     layer_outputs = self._gradient_checkpointing_func(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank1]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank1]:     outputs = run_function(*inputs_cuda)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank1]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank1]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank1]:     attn_output = self.o_proj(attn_output)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank1]:     args_result = hook(self, args)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank1]:     self.pre_sub_module_forward_function(module)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank1]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank1]:     self.__all_gather_params(params_to_fetch, forward)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank1]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank1]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank1]:     handles = _dist_allgather_fn(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank1]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank1]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank1]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank1]:     return self.all_gather_function(output_tensor=output_tensor,
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank1]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank1]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank4]:     main()
[rank4]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank4]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank4]:     return inner_training_loop(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank4]:     tr_loss_step = self.training_step(model, inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank4]:     loss = self.compute_loss(model, inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank4]:     outputs = model(**inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank4]:     loss = self.module(*inputs, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank4]:     outputs = self.model(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank4]:     layer_outputs = self._gradient_checkpointing_func(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank4]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank4]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank4]:     outputs = run_function(*inputs_cuda)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank4]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank4]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank4]:     attn_output = self.o_proj(attn_output)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank4]:     args_result = hook(self, args)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank4]:     self.pre_sub_module_forward_function(module)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank4]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank4]:     return fn(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank4]:     self.__all_gather_params(params_to_fetch, forward)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank4]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank4]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank4]:     handles = _dist_allgather_fn(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank4]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank4]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank4]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank4]:     return fn(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank4]:     return self.all_gather_function(output_tensor=output_tensor,
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank4]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank4]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank5]:     main()
[rank5]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank5]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank5]:     return inner_training_loop(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank5]:     loss = self.compute_loss(model, inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank5]:     outputs = model(**inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank5]:     return forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank5]:     loss = self.module(*inputs, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank5]:     outputs = self.model(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank5]:     layer_outputs = self._gradient_checkpointing_func(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank5]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank5]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank5]:     outputs = run_function(*inputs_cuda)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank5]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank5]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank5]:     attn_output = self.o_proj(attn_output)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank5]:     args_result = hook(self, args)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank5]:     self.pre_sub_module_forward_function(module)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank5]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank5]:     self.__all_gather_params(params_to_fetch, forward)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank5]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank5]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank5]:     handles = _dist_allgather_fn(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank5]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank5]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank5]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank5]:     return self.all_gather_function(output_tensor=output_tensor,
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank5]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank5]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank7]:     main()
[rank7]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank7]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank7]:     return inner_training_loop(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank7]:     tr_loss_step = self.training_step(model, inputs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank7]:     loss = self.compute_loss(model, inputs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank7]:     outputs = model(**inputs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank7]:     loss = self.module(*inputs, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank7]:     result = forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank7]:     outputs = self.model(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank7]:     result = forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank7]:     layer_outputs = self._gradient_checkpointing_func(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank7]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank7]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank7]:     outputs = run_function(*inputs_cuda)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank7]:     result = forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank7]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank7]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank7]:     attn_output = self.o_proj(attn_output)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank7]:     args_result = hook(self, args)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank7]:     self.pre_sub_module_forward_function(module)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank7]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank7]:     return fn(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank7]:     self.__all_gather_params(params_to_fetch, forward)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank7]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank7]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank7]:     handles = _dist_allgather_fn(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank7]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank7]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank7]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank7]:     return fn(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank7]:     return self.all_gather_function(output_tensor=output_tensor,
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank7]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank7]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank0]:     main()
[rank0]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank0]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank0]:     outputs = self.model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank0]:     layer_outputs = self._gradient_checkpointing_func(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank0]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank0]:     outputs = run_function(*inputs_cuda)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank0]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank0]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank0]:     attn_output = self.o_proj(attn_output)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank0]:     args_result = hook(self, args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank0]:     self.pre_sub_module_forward_function(module)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank0]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank0]:     self.__all_gather_params(params_to_fetch, forward)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank0]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank0]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank0]:     handles = _dist_allgather_fn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank0]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank0]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank0]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank0]:     return self.all_gather_function(output_tensor=output_tensor,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank0]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 

Expected behavior

Can successfully run Qwen1.5-14B with full parameter fine-tuning.

regisss commented 4 weeks ago

I can reproduce it, cc @libinta

skaulintel commented 3 weeks ago

@Zjq9409 Have you tried qwen finetune from examples/trl side?