Failed to run BigDL-LLM on Multiple ARC770 using DeepSpeed AutoTP.

liang1wang commented 8 months ago

Case: BigDL/python/llm/example/GPU/Deepspeed-AutoTP Model: Llama-2-7b-hf ARC770: 2 cards env: RPL RVP, ubuntu22.04, kernel-6.4.1, mem-32G oneAPI 23.2.0 Running result:

(llm_multi) intel@ubuntu:~/multi_gpus/BigDL/python/llm/example/GPU/Deepspeed-AutoTP$ export CCL_ATL_TRANSPORT=mpi
(llm_multi) intel@ubuntu:~/multi_gpus/BigDL/python/llm/example/GPU/Deepspeed-AutoTP$ bash run.sh 
...
My guessed rank = 1
My guessed rank = 0
[2023-12-07 14:32:09,814] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2023-12-07 14:32:09,814] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.23s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.23s/it]
[2023-12-07 14:32:19,151] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-12-07 14:32:19,151] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-12-07 14:32:19,152] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-12-07 14:32:19,152] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-12-07 14:32:19,152] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-12-07 14:32:19,153] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-12-07 14:32:19,153] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-12-07 14:32:19,153] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-12-07 14:32:19,156] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-12-07 14:32:19,156] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-07 14:32:19,156] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-12-07 14:32:19,156] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-07 14:32:19,156] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend ccl
2023-12-07 14:32:19,162 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-12-07 14:32:19,163 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-12-07 14:32:19,163 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-12-07 14:32:19,168 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 1
2023-12-07 14:32:19,172 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-12-07 14:32:19,178 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
2023-12-07 14:32:19,178 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
2023-12-07 14:32:19,178 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
AutoTP:  [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
AutoTP:  [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
2023:12:07-14:32:21:( 5571) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2023:12:07-14:32:21:( 5571) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5572 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 5571) of binary: /home/intel/anaconda3/envs/llm_multi/bin/python
Traceback (most recent call last):
  File "/home/intel/anaconda3/envs/llm_multi/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/intel/anaconda3/envs/llm_multi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
deepspeed_autotp.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-07_14:32:22
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 5571)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 5571
======================================================

result_2a_1.txt

liang1wang commented 8 months ago

Upload a log for: If only 1 arc700 installed, "Inference" is done but still see the err, thanks! "ERROR:torch.distributed.elastic.multiprocessing.api:failed" result_1a.txt

plusbang commented 8 months ago

Hi, @liang1wang , as we synced offline, please try to use mpirun instead of torchrun, take the example script for example.

intel-analytics / ipex-llm

Failed to run BigDL-LLM on Multiple ARC770 using DeepSpeed AutoTP. #9628