Open molang66 opened 3 days ago
@delock, can you please help? Thanks!
@molang66 Hi, I reran the cmd that you pasted in this issue and found that no such error appeared. So I think that must be some version mismatch or outdated.
I verify the cmd with following versions:
Ubuntu 22.04.2 LTS torch 2.3 intel-extension-for-pytorch 2.3.110 oneccl-bind-pt 2.3.0+gpu (torch/ipex/onecclbindpt wheels can be found at https://pytorch-extension.intel.com/release-whl/stable/xpu/us/) oneAPI 2024.2.1 GPU Driver 950.13 (rolling stable version)
Can you provide more details about your development environment? Or you can try using my verified versions :)
@delock, can you please help? Thanks!
Hi @tjruwase, @Liangliang-Ma will followup with this issue. Thanks!
Thank so much for help. I have updated my CCL version, and now I am encountering this issue:
rank0: Traceback (most recent call last): rank0: File "/work2/09250/molang66/stampede3/transformers/examples/pytorch/language-modeling/run_clm.py", line 657, in
rank0: File "/work2/09250/molang66/stampede3/transformers/examples/pytorch/language-modeling/run_clm.py", line 605, in main rank0: train_result = trainer.train(resume_from_checkpoint=checkpoint) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/trainer.py", line 2141, in train rank0: return inner_training_loop( rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/trainer.py", line 2495, in _inner_training_loop rank0: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/trainer.py", line 3613, in training_step rank0: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/trainer.py", line 3667, in compute_loss rank0: outputs = model(inputs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn rank0: ret_val = func(*args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1899, in forward rank0: loss = self.module(*inputs, *kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1199, in forward rank0: outputs = self.model( rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 926, in forward rank0: position_embeddings = self.rotary_emb(hidden_states, position_ids) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, kwargs) rank0: File "/work2/09250/molang66/stampede3/miniconda3/envs/intel_xpu/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 160, in forward rank0: freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) rank0: RuntimeError: could not create an engine
I was running on Stampede3 cluster, and my environments are as follows
OS: centos
conda python = 3.9
intel_extension_for_pytorch 2.3.110+xpu
oneccl-bind-pt 2.3.100+xpu
torch 2.3.1+cxx11.abi
oneAPI 2024.2.1
GPU driver:
[level_zero:gpu][level_zero:7] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.27642]
Describe the bug I’m experiencing an issue when fine-tuning the Llama-2-7b model from Hugging Face with Zero optimization enabled. I am running on 8 Intel Max 1550 GPUs using the code from the examples provided in Intel Extension for DeepSpeed.
The model loads and runs successfully without Zero optimization, but when I enable Zero optimization (particularly with stage 3), I encounter the following errors: [rank0]: RuntimeError: could not create an engine 2024:11:05-02:39:09:(678567) |CCL_INFO| finalizing level-zero 2024:11:05-02:39:09:(678567) |CCL_INFO| finalized level-zero 0%| | 0/50 [00:00<?, ?it/s] 2024:11:05-02:39:09:(678572) |CCL_INFO| finalizing level-zero 2024:11:05-02:39:09:(678566) |CCL_INFO| finalizing level-zero ... [2024-11-05 02:39:10,447] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 678572
System info Model: Llama-2-7b from Hugging Face GPUs: 8x Intel Max 1550 GPUs Software: • Intel Extension for pytorch • DeepSpeed with Zero Optimization (Stage 3) • oneCCL for communication backend
Launcher context cd transformers deepspeed --num_gpus=8 examples/pytorch/language-modeling/run_clm.py \ --deepspeed tests/deepspeed/ds_config_zero3.json \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --dataloader_num_workers 0 \ --per_device_train_batch_size 1 \ --warmup_steps 10 \ --max_steps 50 \ --bf16 \ --do_train \ --output_dir /tmp/test-clm \ --overwrite_output_dir