intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.43k stars 1.24k forks source link

Failed to run BigDL-LLM on Multiple ARC770 using DeepSpeed AutoTP. #9628

Open liang1wang opened 8 months ago

liang1wang commented 8 months ago

Case: BigDL/python/llm/example/GPU/Deepspeed-AutoTP Model: Llama-2-7b-hf ARC770: 2 cards env: RPL RVP, ubuntu22.04, kernel-6.4.1, mem-32G oneAPI 23.2.0 Running result:

(llm_multi) intel@ubuntu:~/multi_gpus/BigDL/python/llm/example/GPU/Deepspeed-AutoTP$ export CCL_ATL_TRANSPORT=mpi
(llm_multi) intel@ubuntu:~/multi_gpus/BigDL/python/llm/example/GPU/Deepspeed-AutoTP$ bash run.sh 
...
My guessed rank = 1
My guessed rank = 0
[2023-12-07 14:32:09,814] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2023-12-07 14:32:09,814] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.23s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.23s/it]
[2023-12-07 14:32:19,151] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-12-07 14:32:19,151] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-12-07 14:32:19,152] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-12-07 14:32:19,152] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-12-07 14:32:19,152] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-12-07 14:32:19,153] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-12-07 14:32:19,153] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-12-07 14:32:19,153] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-12-07 14:32:19,156] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-12-07 14:32:19,156] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-07 14:32:19,156] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-12-07 14:32:19,156] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-07 14:32:19,156] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend ccl
2023-12-07 14:32:19,162 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-12-07 14:32:19,163 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-12-07 14:32:19,163 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-12-07 14:32:19,168 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 1
2023-12-07 14:32:19,172 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-12-07 14:32:19,178 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
2023-12-07 14:32:19,178 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
2023-12-07 14:32:19,178 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
AutoTP:  [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
AutoTP:  [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
2023:12:07-14:32:21:( 5571) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2023:12:07-14:32:21:( 5571) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5572 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 5571) of binary: /home/intel/anaconda3/envs/llm_multi/bin/python
Traceback (most recent call last):
  File "/home/intel/anaconda3/envs/llm_multi/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
deepspeed_autotp.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-07_14:32:22
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 5571)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 5571
======================================================

result_2a_1.txt

liang1wang commented 8 months ago

Upload a log for: If only 1 arc700 installed, "Inference" is done but still see the err, thanks! "ERROR:torch.distributed.elastic.multiprocessing.api:failed" result_1a.txt

plusbang commented 8 months ago

Hi, @liang1wang , as we synced offline, please try to use mpirun instead of torchrun, take the example script for example.