Open biyuehuang opened 1 year ago
It's current WIP: https://github.com/intel-analytics/BigDL/pull/9230
hi , I notice the PR has been merged. Can I try by bigdl20231025? Do you have bigdl with deepspeed installation guide?
bigdl-core-xe 2.4.0b20231026 bigdl-core-xe-esimd 2.4.0b20231026 bigdl-llm 2.4.0b20231026 intel-extension-for-pytorch 2.0.110+xpu
$ ./run.sh
found intel-openmp in /home/adc-a770/miniconda3/envs/llm-test/lib/libiomp5.so
found oneapi in /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
run.sh: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::
+++++ Env Variables +++++
LD_PRELOAD = /home/adc-a770/miniconda3/envs/llm-test/lib/libiomp5.so
OMP_NUM_THREADS = 28
USE_XETLA = OFF
ENABLE_SDP_FUSION = 1
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
Complete.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
My guessed rank = 2
My guessed rank = 1
My guessed rank = 0
My guessed rank = 3
Traceback (most recent call last):
File "/home/adc-a770/llm/bigdl/deepspeed/deepspeed_autotp.py", line 20, in <module>
import deepspeed
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/__init__.py", line 21, in <module>
from . import ops
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 34, in <module>
from deepspeed.utils import timer, get_caller_func
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 31, in <module>
class CudaEventTimer(object):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 33, in CudaEventTimer
def __init__(self, start_event: get_accelerator().Event, end_event: get_accelerator().Event):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/real_accelerator.py", line 142, in get_accelerator
from .cpu_accelerator import CPU_Accelerator
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/cpu_accelerator.py", line 8, in <module>
import oneccl_bindings_for_pytorch # noqa: F401 # type: ignore
ModuleNotFoundError: No module named 'oneccl_bindings_for_pytorch'
Traceback (most recent call last):
File "/home/adc-a770/llm/bigdl/deepspeed/deepspeed_autotp.py", line 20, in <module>
import deepspeed
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/__init__.py", line 21, in <module>
from . import ops
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 34, in <module>
from deepspeed.utils import timer, get_caller_func
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 31, in <module>
class CudaEventTimer(object):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 33, in CudaEventTimer
def __init__(self, start_event: get_accelerator().Event, end_event: get_accelerator().Event):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/real_accelerator.py", line 142, in get_accelerator
from .cpu_accelerator import CPU_Accelerator
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/cpu_accelerator.py", line 8, in <module>
import oneccl_bindings_for_pytorch # noqa: F401 # type: ignore
ModuleNotFoundError: No module named 'oneccl_bindings_for_pytorch'
Traceback (most recent call last):
File "/home/adc-a770/llm/bigdl/deepspeed/deepspeed_autotp.py", line 20, in <module>
import deepspeed
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/__init__.py", line 21, in <module>
from . import ops
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 34, in <module>
from deepspeed.utils import timer, get_caller_func
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 31, in <module>
class CudaEventTimer(object):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 33, in CudaEventTimer
def __init__(self, start_event: get_accelerator().Event, end_event: get_accelerator().Event):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/real_accelerator.py", line 142, in get_accelerator
from .cpu_accelerator import CPU_Accelerator
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/cpu_accelerator.py", line 8, in <module>
import oneccl_bindings_for_pytorch # noqa: F401 # type: ignore
ModuleNotFoundError: No module named 'oneccl_bindings_for_pytorch'
Traceback (most recent call last):
File "/home/adc-a770/llm/bigdl/deepspeed/deepspeed_autotp.py", line 20, in <module>
import deepspeed
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/__init__.py", line 21, in <module>
from . import ops
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 34, in <module>
from deepspeed.utils import timer, get_caller_func
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 31, in <module>
class CudaEventTimer(object):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 33, in CudaEventTimer
def __init__(self, start_event: get_accelerator().Event, end_event: get_accelerator().Event):
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/real_accelerator.py", line 142, in get_accelerator
from .cpu_accelerator import CPU_Accelerator
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/deepspeed/accelerator/cpu_accelerator.py", line 8, in <module>
import oneccl_bindings_for_pytorch # noqa: F401 # type: ignore
ModuleNotFoundError: No module named 'oneccl_bindings_for_pytorch'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 58011) of binary: /home/adc-a770/miniconda3/envs/llm-test/bin/python
Traceback (most recent call last):
File "/home/adc-a770/miniconda3/envs/llm-test/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
deepspeed_autotp.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-10-27_09:34:30
host : adc-a770-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 58012)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-10-27_09:34:30
host : adc-a770-0
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 58013)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-10-27_09:34:30
host : adc-a770-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 58014)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-27_09:34:30
host : adc-a770-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 58011)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hi, it seems that you did not install the necessary package. Would you mind checking the following package is correctly installed?
@yangw1234 Hi Yang, I have installed all pull/9289 package, but still have Error on 2 Arc
$ cat run.sh
source bigdl-llm-init -t -g
export MASTER_ADDR=127.0.0.1
export CCL_ZE_IPC_EXCHANGE=sockets
NUM_GPUS=2
if [[ -n $OMP_NUM_THREADS ]]; then
export OMP_NUM_THREADS=$(($OMP_NUM_THREADS / $NUM_GPUS))
else
export OMP_NUM_THREADS=$(($(nproc) / $NUM_GPUS))
fi
torchrun --standalone \
--nnodes=1 \
--nproc-per-node $NUM_GPUS \
deepspeed_autotp.py --repo-id-or-model-path "/home/adc-a770/data/Llama-2-7b-chat-hf"
Error Log:
$ ./run.sh
found oneapi in /opt/intel/oneapi/setvars.sh
:: WARNING: setvars.sh has already been run. Skipping re-execution.
To force a re-execution of setvars.sh, use the '--force' option.
Using '--force' can result in excessive use of your environment variables.
usage: source setvars.sh [--force] [--config=file] [--help] [...]
--force Force setvars.sh to re-run, doing so may overload environment.
--config=file Customize env vars using a setvars.sh configuration file.
--help Display this help message and exit.
... Additional args are passed to individual env/vars.sh scripts
and should follow this script's arguments.
Some POSIX shells do not accept command-line options. In that case, you can pass
command-line options via the SETVARS_ARGS environment variable. For example:
$ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
$ . path/to/setvars.sh
The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.
+++++ Env Variables +++++
LD_PRELOAD =
OMP_NUM_THREADS =
USE_XETLA = OFF
ENABLE_SDP_FUSION = 1
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
Complete.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
My guessed rank = 0
My guessed rank = 1
[2023-10-27 16:16:26,467] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2023-10-27 16:16:26,479] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00, 18.90it/s]
[2023-10-27 16:16:27,507] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-10-27 16:16:27,507] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-10-27 16:16:27,507] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-10-27 16:16:27,508] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /home/adc-a770/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Emitting ninja build file /home/adc-a770/.cache/torch_extensions/py39_cpu/deepspeed_ccl_comm/build.ninja...
Building extension module deepspeed_ccl_comm...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module deepspeed_ccl_comm...
Time to load deepspeed_ccl_comm op: 0.08489537239074707 seconds
DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00, 18.25it/s]
[2023-10-27 16:16:27,721] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-10-27 16:16:27,721] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-10-27 16:16:27,722] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-10-27 16:16:27,722] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /home/adc-a770/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Emitting ninja build file /home/adc-a770/.cache/torch_extensions/py39_cpu/deepspeed_ccl_comm/build.ninja...
Building extension module deepspeed_ccl_comm...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module deepspeed_ccl_comm...
Time to load deepspeed_ccl_comm op: 0.1180427074432373 seconds
DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
2023-10-27 16:16:28,598 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-10-27 16:16:28,606 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-27 16:16:28,606 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023:10:27-16:16:28:(126995) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2023:10:27-16:16:28:(126995) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2023-10-27 16:16:28,608 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023:10:27-16:16:28:(126995) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
2023:10:27-16:16:28:(126996) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2023:10:27-16:16:28:(126996) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2023:10:27-16:16:28:(126996) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
[2023-10-27 16:16:29,588] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-10-27 16:16:29,588] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-10-27 16:16:29,588] [INFO] [comm.py:637:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x7fd32bafe910>
[2023-10-27 16:16:29,588] [INFO] [comm.py:637:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x7f2bc92bf9d0>
[2023-10-27 16:16:29,589] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
2023-10-27 16:16:29,590 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-27 16:16:29,590 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 1
2023-10-27 16:16:29,591 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
2023-10-27 16:16:29,591 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
AutoTP: AutoTP: [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
[(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
2023-10-27 16:16:30,052 - bigdl.llm.transformers.utils - INFO - Converting the current model to sym_int4 format......
2023-10-27 16:16:30,056 - bigdl.llm.transformers.utils - INFO - Converting the current model to sym_int4 format......
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(k_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(v_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(o_proj): LowBitLinear(in_features=2048, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(up_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(down_proj): LowBitLinear(in_features=5504, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): LowBitLinear(in_features=4096, out_features=32000, bias=False)
)
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(k_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(v_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(o_proj): LowBitLinear(in_features=2048, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(up_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(down_proj): LowBitLinear(in_features=5504, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): LowBitLinear(in_features=4096, out_features=32000, bias=False)
)
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
Traceback (most recent call last):
File "/home/adc-a770/llm/bigdl/deepspeed/deepspeed_autotp.py", line 81, in <module>
output = model.generate(input_ids,
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/generation/utils.py", line 1538, in generate
return self.greedy_search(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/generation/utils.py", line 2362, in greedy_search
outputs = self(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
outputs = self.model(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
layer_outputs = decoder_layer(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/bigdl/llm/transformers/models/llama.py", line 126, in llama_attention_forward_4_31
query_states = self.q_proj(hidden_states)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/bigdl/llm/transformers/low_bit_linear.py", line 375, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:1 and xpu:0! (when checking argument for argument mat2 in method wrapper_XPU__mm)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126995 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 126996) of binary: /home/adc-a770/miniconda3/envs/bigdl-deepspeed/bin/python
Traceback (most recent call last):
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
deepspeed_autotp.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-27_16:16:39
host : adc-a770-0
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 126996)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 126996
========================================================
bigdl20231029
$ pip list
Package Version
----------------------------- ------------------
accelerate 0.21.0
annotated-types 0.6.0
bigdl-core-xe 2.4.0b20231029
bigdl-core-xe-esimd 2.4.0b20231029
bigdl-llm 2.4.0b20231029
certifi 2023.7.22
charset-normalizer 3.3.1
deepspeed 0.11.2+78c518ed
filelock 3.12.4
fsspec 2023.10.0
hjson 3.1.0
huggingface-hub 0.18.0
idna 3.4
intel-extension-for-deepspeed 0.9.4+ec33277
intel-extension-for-pytorch 2.0.110+xpu
Jinja2 3.1.2
MarkupSafe 2.1.3
mpi4py 3.1.5
mpmath 1.3.0
networkx 3.2
ninja 1.11.1.1
numpy 1.26.1
oneccl-bind-pt 2.0.100+gpu
packaging 23.2
Pillow 10.1.0
pip 23.3
protobuf 4.25.0rc2
psutil 5.9.6
py-cpuinfo 9.0.0
pydantic 2.4.2
pydantic_core 2.10.1
PyYAML 6.0.1
regex 2023.10.3
requests 2.31.0
safetensors 0.4.0
sentencepiece 0.1.99
setuptools 68.0.0
sympy 1.12
tabulate 0.9.0
tokenizers 0.13.3
torch 2.0.1a0+cxx11.abi
torchvision 0.15.2a0+cxx11.abi
tqdm 4.66.1
transformers 4.31.0
typing_extensions 4.8.0
urllib3 2.0.7
wheel 0.41.2
$ ./run.sh
found oneapi in /opt/intel/oneapi/setvars.sh
:: WARNING: setvars.sh has already been run. Skipping re-execution.
To force a re-execution of setvars.sh, use the '--force' option.
Using '--force' can result in excessive use of your environment variables.
usage: source setvars.sh [--force] [--config=file] [--help] [...]
--force Force setvars.sh to re-run, doing so may overload environment.
--config=file Customize env vars using a setvars.sh configuration file.
--help Display this help message and exit.
... Additional args are passed to individual env/vars.sh scripts
and should follow this script's arguments.
Some POSIX shells do not accept command-line options. In that case, you can pass
command-line options via the SETVARS_ARGS environment variable. For example:
$ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
$ . path/to/setvars.sh
The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.
+++++ Env Variables +++++
LD_PRELOAD =
OMP_NUM_THREADS =
USE_XETLA = OFF
ENABLE_SDP_FUSION = 1
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
Complete.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
My guessed rank = 1
My guessed rank = 0
[2023-10-30 17:05:15,386] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
[2023-10-30 17:05:15,567] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.35it/s]
[2023-10-30 17:05:16,422] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-10-30 17:05:16,423] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-10-30 17:05:16,423] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-10-30 17:05:16,423] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-10-30 17:05:16,425] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-10-30 17:05:16,425] [INFO] [comm.py:637:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00, 20.19it/s]
[2023-10-30 17:05:16,590] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+78c518ed, git-hash=78c518ed, git-branch=HEAD
[2023-10-30 17:05:16,591] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-10-30 17:05:16,591] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-10-30 17:05:16,591] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-10-30 17:05:16,592] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-10-30 17:05:16,592] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-30 17:05:16,592] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend ccl
2023-10-30 17:05:17,427 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-10-30 17:05:17,428 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-30 17:05:17,428 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-10-30 17:05:17,429 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-30 17:05:17,437 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-10-30 17:05:17,438 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 1
2023-10-30 17:05:17,438 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
2023-10-30 17:05:17,440 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
AutoTP: [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['mlp.down_proj', 'self_attn.o_proj'])]
AutoTP: [(<class 'transformers.models.llama.modeling_llama.LlamaDecoderLayer'>, ['self_attn.o_proj', 'mlp.down_proj'])]
2023:10:30-17:05:20:(1498874) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2023:10:30-17:05:20:(1498874) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2023:10:30-17:05:20:(1498874) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
2023:10:30-17:05:20:(1498875) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2023:10:30-17:05:20:(1498875) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2023:10:30-17:05:20:(1498875) |CCL_WARN| sockets exchange mode is set. It may cause potential problem of 'Too many open file descriptors'
2023-10-30 17:05:22,876 - bigdl.llm.transformers.utils - INFO - Converting the current model to sym_int4 format......
2023-10-30 17:05:23,234 - bigdl.llm.transformers.utils - INFO - Converting the current model to sym_int4 format......
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(k_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(v_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(o_proj): LowBitLinear(in_features=2048, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(up_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(down_proj): LowBitLinear(in_features=5504, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): LowBitLinear(in_features=4096, out_features=32000, bias=False)
)
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(k_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(v_proj): LowBitLinear(in_features=4096, out_features=2048, bias=False)
(o_proj): LowBitLinear(in_features=2048, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(up_proj): LowBitLinear(in_features=4096, out_features=5504, bias=False)
(down_proj): LowBitLinear(in_features=5504, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): LowBitLinear(in_features=4096, out_features=32000, bias=False)
)
/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
adc-a770-0:pid1498874.python: Reading from remote process' memory failed. Disabling CMA support
adc-a770-0:pid1498875.python: Reading from remote process' memory failed. Disabling CMA support
adc-a770-0:pid1498874: Assertion failure at psm3/ptl_am/ptl.c:195: nbytes == req->req_data.recv_msglen
adc-a770-0:pid1498875: Assertion failure at psm3/ptl_am/ptl.c:195: nbytes == req->req_data.recv_msglen
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1498874) of binary: /home/adc-a770/miniconda3/envs/bigdl-deepspeed/bin/python
Traceback (most recent call last):
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/adc-a770/miniconda3/envs/bigdl-deepspeed/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
deepspeed_autotp.py FAILED
--------------------------------------------------------
Failures:
[1]:
time : 2023-10-30_17:05:33
host : adc-a770-0
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1498875)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1498875
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-30_17:05:33
host : adc-a770-0
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 1498874)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1498874
========================================================
I did not reproduce the issue. here is my configuration:
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
bigdl-core-xe 2.4.0b20231030
bigdl-core-xe-esimd 2.4.0b20231030
bigdl-llm 2.4.0b20231030
I did not reproduce the issue. here is my configuration:
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000] [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] [opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 3.0 [23.17.26241.33] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] [ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241] bigdl-core-xe 2.4.0b20231030 bigdl-core-xe-esimd 2.4.0b20231030 bigdl-llm 2.4.0b20231030
The same Error between bigdl20231029 and 20231030 :torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1758856) of binary: /home/adc-a770/miniconda3/envs/bigdl-deepspeed/bin/python
hello, can bigdl-llm support distributed inference with Deepspeed on 2 Arc DGPU? $ sycl-ls [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734] [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 5420+ 3.0 [2023.16.6.0.22_223734] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.26.26690.36] [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.26.26690.36] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]