intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.54k stars 234 forks source link

How to solve "NotImplementedError: Cannot copy out of meta tensor; no data!"? #446

Closed gukejun1 closed 10 months ago

gukejun1 commented 10 months ago

Describe the issue

Dear:

When I run Intel® Extension for PyTorch on the aws instance r7iz.4xlarge, I get the following error:

deepspeed --bind_cores_to_rank run_generation_with_deepspeed.py --benchmark -m EleutherAI/gpt-j-6b  --dtype float32  --ipex --jit  --print-memory
.....
.....
.....
My guessed rank = 0
Warning: Cannot load xpu CCL. CCL doesn't work for XPU device
[2023-10-16 19:18:04,388] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cpu/deepspeed_ccl_comm/build.ninja...
Building extension module deepspeed_ccl_comm...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module deepspeed_ccl_comm...
Time to load deepspeed_ccl_comm op: 0.12543606758117676 seconds
DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
2023:10:16-19:18:06:(56358) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2023:10:16-19:18:06:(56358) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
[2023-10-16 19:18:07,140] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
[2023-10-16 19:18:07,140] [INFO] [comm.py:631:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x7fb6bbd1ff70>
[2023-10-16 19:18:07,140] [INFO] [comm.py:656:init_distributed] Distributed backend already initialized
*** Loading the model EleutherAI/gpt-j-6b
[2023-10-16 19:18:07,410] [INFO] [utils.py:803:see_memory_usage] pre-from-pretrained
[2023-10-16 19:18:07,410] [INFO] [utils.py:804:see_memory_usage] MA 0.68 GB         Max_MA 0.68 GB         CA 0.68 GB         Max_CA 1 GB 
[2023-10-16 19:18:07,410] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 3.19 GB, percent = 2.6%
[2023-10-16 19:18:07,527] [INFO] [utils.py:803:see_memory_usage] post-from-pretrained
[2023-10-16 19:18:07,528] [INFO] [utils.py:804:see_memory_usage] MA 0.69 GB         Max_MA 0.69 GB         CA 0.69 GB         Max_CA 1 GB 
[2023-10-16 19:18:07,528] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 3.19 GB, percent = 2.6%
[2023-10-16 19:18:07,648] [INFO] [utils.py:803:see_memory_usage] post-init-ds-zero-init
[2023-10-16 19:18:07,648] [INFO] [utils.py:804:see_memory_usage] MA 0.69 GB         Max_MA 0.69 GB         CA 0.69 GB         Max_CA 1 GB 
[2023-10-16 19:18:07,648] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 3.19 GB, percent = 2.6%
[2023-10-16 19:18:07,765] [INFO] [utils.py:803:see_memory_usage] pre-ds-inference-init
[2023-10-16 19:18:07,765] [INFO] [utils.py:804:see_memory_usage] MA 0.69 GB         Max_MA 0.69 GB         CA 0.69 GB         Max_CA 1 GB 
[2023-10-16 19:18:07,765] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 3.18 GB, percent = 2.6%
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 190650.18it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 43520.66it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 45282.63it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 42100.92it/s]
[2023-10-16 19:18:07,920] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2+f0ef3eaa, git-hash=f0ef3eaa, git-branch=gma/run-opt-branch
[2023-10-16 19:18:07,921] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-10-16 19:18:07,921] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
  File "/home/mm/work_dir/run_generation_with_deepspeed.py", line 265, in <module>
    model = deepspeed.init_inference(
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed-0.10.2+f0ef3eaa-py3.9.egg/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed-0.10.2+f0ef3eaa-py3.9.egg/deepspeed/inference/engine.py", line 154, in __init__
    self.module.to(device)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1164, in to
    return self._apply(convert)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 814, in _apply
    module._apply(fn)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 814, in _apply
    module._apply(fn)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 837, in _apply
    param_applied = fn(param)
  File "/home/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1162, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: spr [Intel(R) Xeon(R) Gold 6455B]
Registry and code: 13 MB
Command: /home/miniconda3/envs/py39/bin/python -u run_generation_with_deepspeed.py --local_rank=0 --benchmark -m EleutherAI/gpt-j-6b --dtype float32 --ipex --jit --print-memory
Uptime: 4.164063 s
[2023-10-16 19:18:08,509] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 56358
[2023-10-16 19:18:08,510] [ERROR] [launch.py:321:sigkill_handler] ['numactl', '-m', '0', '-C', '0-7', '/home/miniconda3/envs/py39/bin/python', '-u', 'run_generation_with_deepspeed.py', '--local_rank=0', '--benchmark', '-m', 'EleutherAI/gpt-j-6b', '--dtype', 'float32', '--ipex', '--jit', '--print-memory'] exits with return code = -11

pip environment information:

pip list
Package                     Version
--------------------------- ----------------------
accelerate                  0.23.0
aiohttp                     3.8.6
aiosignal                   1.3.1
async-timeout               4.0.3
attrs                       23.1.0
certifi                     2023.7.22
charset-normalizer          3.3.0
cmake                       3.27.7
contextlib2                 21.6.0
contourpy                   1.1.1
cpuid                       0.0.11
cpuid-native                0.0.8
cycler                      0.12.1
datasets                    2.14.5
deepspeed                   0.10.2+f0ef3eaa
Deprecated                  1.2.14
dill                        0.3.7
filelock                    3.12.4
fonttools                   4.43.1
frozenlist                  1.4.0
fsspec                      2023.6.0
hjson                       3.1.0
huggingface-hub             0.18.0
idna                        3.4
importlib-resources         6.1.0
intel-extension-for-pytorch 2.1.0.dev0+cpu.llm
Jinja2                      3.1.2
joblib                      1.3.2
kiwisolver                  1.4.5
lit                         17.0.2
MarkupSafe                  2.1.3
matplotlib                  3.8.0
mpmath                      1.3.0
multidict                   6.0.4
multiprocess                0.70.15
networkx                    3.1
neural-compressor           2.2
ninja                       1.11.1.1
numpy                       1.26.1
nvidia-cublas-cu11          11.10.3.66
nvidia-cublas-cu12          12.1.3.1
nvidia-cuda-cupti-cu11      11.7.101
nvidia-cuda-cupti-cu12      12.1.105
nvidia-cuda-nvrtc-cu11      11.7.99
nvidia-cuda-nvrtc-cu12      12.1.105
nvidia-cuda-runtime-cu11    11.7.99
nvidia-cuda-runtime-cu12    12.1.105
nvidia-cudnn-cu11           8.5.0.96
nvidia-cudnn-cu12           8.9.2.26
nvidia-cufft-cu11           10.9.0.58
nvidia-cufft-cu12           11.0.2.54
nvidia-curand-cu11          10.2.10.91
nvidia-curand-cu12          10.3.2.106
nvidia-cusolver-cu11        11.4.0.1
nvidia-cusolver-cu12        11.4.5.107
nvidia-cusparse-cu11        11.7.4.91
nvidia-cusparse-cu12        12.1.0.106
nvidia-nccl-cu11            2.14.3
nvidia-nccl-cu12            2.18.1
nvidia-nvjitlink-cu12       12.2.140
nvidia-nvtx-cu11            11.7.91
nvidia-nvtx-cu12            12.1.105
oneccl-bind-pt              2.1.0+cpu
opencv-python               4.8.1.78
opencv-python-headless      4.8.1.78
packaging                   23.2
pandas                      2.1.1
Pillow                      10.1.0
pip                         23.2.1
prettytable                 3.9.0
protobuf                    3.20.3
psutil                      5.9.6
py-cpuinfo                  9.0.0
pyarrow                     13.0.0
pycocotools                 2.0.7
pydantic                    1.10.13
pyparsing                   3.1.1
python-dateutil             2.8.2
pytz                        2023.3.post1
PyYAML                      6.0.1
regex                       2023.10.3
requests                    2.31.0
schema                      0.7.5
scikit-learn                1.3.1
scipy                       1.11.3
sentencepiece               0.1.99
setuptools                  68.0.0
six                         1.16.0
sympy                       1.12
threadpoolctl               3.2.0
tokenizers                  0.13.3
torch                       2.1.0.dev20230711+cpu
torchaudio                  2.1.0.dev20230711+cpu
torchvision                 0.16.0.dev20230711+cpu
tqdm                        4.66.1
transformers                4.28.1
triton                      2.1.0
typing_extensions           4.8.0
tzdata                      2023.3
urllib3                     2.0.6
wcwidth                     0.2.8
wheel                       0.41.2
wrapt                       1.15.0
xxhash                      3.4.1
yarl                        1.9.2
zipp                        3.17.0

Server Information: aws

 aws instance r7iz.4xlarge
4th Generation Intel Xeon Scalable-based
128 GB memory
Only CPUs, no GPUs.

how to solve it ?

kta-intel commented 10 months ago

Thanks for reporting. I was able to reproduce on a m7i.16xlarge (256 GB memory) instance. We are looking into this.

gukejun1 commented 10 months ago

sad. I changed the m7i.16xlarge and the same error was reported, as follows:


ogging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2+f0ef3eaa, git-hash=f0ef3eaa, git-branch=gma/run-opt-branch
[2023-10-18 09:43:38,982] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-10-18 09:43:38,982] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
  File "/home/hh/workdir/run_generation_with_deepspeed.py", line 265, in <module>
    model = deepspeed.init_inference(
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed-0.10.2+f0ef3eaa-py3.9.egg/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/deepspeed-0.10.2+f0ef3eaa-py3.9.egg/deepspeed/inference/engine.py", line 154, in __init__
    self.module.to(device)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1164, in to
    return self._apply(convert)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 814, in _apply
    module._apply(fn)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 814, in _apply
    module._apply(fn)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 837, in _apply
    param_applied = fn(param)
  File "/home/hh/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1162, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-10-18 09:43:40,026] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 9956
[2023-10-18 09:43:40,026] [ERROR] [launch.py:321:sigkill_handler] ['numactl', '-m', '0', '-C', '0-31', '/home/hh/miniconda3/envs/py39/bin/python', '-u', 'run_generation_with_deepspeed.py', '--local_rank=0', '--benchmark', '-m', 'EleutherAI/pythia-70m', '--dtype', 'float32', '--ipex', '--jit', '--print-memory'] exits with return code = 1
jingxu10 commented 10 months ago

Hi @jianan-gu , could you take a look at this issue?

kta-intel commented 10 months ago

@gukejun1 Can you try again with the latest script run_generation_with_deepspeed.py

Seems an issue with tensor parallelism size is causing the out of meta tensor issue. It was observed that gpt-j is not supporting --jit, and it is advisable to remove that flag.

try: deepspeed --bind_cores_to_rank run_generation_with_deepspeed.py --benchmark -m EleutherAI/gpt-j-6b --dtype float32 --ipex --print-memory

jingxu10 commented 10 months ago

sad . same error :

pls check the comment above.

gukejun1 commented 10 months ago

sad . same error :

pls check the comment above.

....
[2023-10-25 14:56:02,328] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 26.54 GB, percent = 10.7%
Traceback (most recent call last):
  File "/home/ff/workdir/intel-extension-for-pytorch/examples/cpu/inference/python/llm/distributed/run_generation_with_deepspeed.py", line 362, in <module>
    model = ipex.optimize_transformers(
AttributeError: module 'intel_extension_for_pytorch' has no attribute 'optimize_transformers'
[2023-10-25 14:56:04,128] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 13405
[2023-10-25 14:56:04,128] [ERROR] [launch.py:321:sigkill_handler] ['numactl', '-m', '0', '-C', '0-31', '/home/ff/miniconda3/envs/py39/bin/python', '-u', 'run_generation_with_deepspeed.py', '--local_rank=0', '--benchmark', '-m', 'EleutherAI/gpt-j-6b', '--dtype', 'float32', '--ipex', '--print-memory'] exits with return code = 1

The running directory is the latest script run_generation_with_deepspeed.py intel-extension-for-pytorch/examples/cpu/inference/python/llm/distributed. In addition, my intel-extension-for-pytorch is installed in the following way: python -m pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.1.0.dev0%2Bcpu.llm-cp39-cp39-linux_x86_64.whl. How to obtain attribute'optiimize_transformers' ?

kta-intel commented 10 months ago

You should be okay to go ahead and install the latest released binary v2.1.0 instead of the dev branch. LLM optimizations have been released with it. https://intel.github.io/intel-extension-for-pytorch/#installation

My guess is that wheel file is still using ._optimize_transformers

gukejun1 commented 10 months ago

You should be okay to go ahead and install the latest released binary v2.1.0 instead of the dev branch. LLM optimizations have been released with it. https://intel.github.io/intel-extension-for-pytorch/#installation

My guess is that wheel file is still using ._optimize_transformers

it works ,thanks