XPU Installation Instructions Don't Work

coreyjadams commented 1 year ago

Describe the bug

When I copy/paste the installation instructions for XPU, they fail:

❯ python -m pip install torch==1.13.0a0+git6c9b55e intel_extension_for_pytorch==1.13.120+xpu -f https://developer.intel.com/ipex-whl-stable-xpu
Looking in links: https://developer.intel.com/ipex-whl-stable-xpu
ERROR: Could not find a version that satisfies the requirement torch==1.13.0a0+git6c9b55e (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0a0+git3d5f2d4, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0a0+gitb1dde16, 1.13.0, 1.13.1, 2.0.0, 2.0.1)
ERROR: No matching distribution found for torch==1.13.0a0+git6c9b55e
❯ python -m pip install torch==1.13.0a0 intel_extension_for_pytorch==1.13.120+xpu -f https://developer.intel.com/ipex-whl-stable-xpu
Looking in links: https://developer.intel.com/ipex-whl-stable-xpu
Collecting torch==1.13.0a0
  Using cached https://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/ipex_stable/xpu/torch-1.13.0a0%2Bgitb1dde16-cp39-cp39-linux_x86_64.whl (140.5 MB)
ERROR: Could not find a version that satisfies the requirement intel_extension_for_pytorch==1.13.120+xpu (from versions: 1.10.100, 1.10.200+gpu, 1.11.0, 1.11.100, 1.11.200, 1.12.0, 1.12.100, 1.12.200, 1.12.300, 1.13.0, 1.13.10+xpu, 1.13.100, 2.0.0, 2.0.100)
ERROR: No matching distribution found for intel_extension_for_pytorch==1.13.120+xpu

You can see I can get torch 1.13.0a0 only if I drop the git hash that is stated as mandatory. Further, the IPEX components don't seem to be available either. Can you confirm the installation instructions work for XPU?

Thanks!

Versions

This is on an early-access system at Argonne National Laboratory.

gujinghui commented 1 year ago

@jingxu10 @tye1 Please confirm ASAP.

DaWe35 commented 1 year ago

Same for me

python -m pip install torch==1.13.0a0+git6c9b55e intel_extension_for_pytorch==1.13.120+xpu -f https://developer.intel.com/ipex-whl-stable-xpu

and

ERROR: Could not find a version that satisfies the requirement intel_extension_for_pytorch==1.13.120+xpu (from versions: 2.0.0, 2.0.100)
ERROR: No matching distribution found for intel_extension_for_pytorch==1.13.120+xpu

Python version: Python 3.11.3

jingxu10 commented 1 year ago

@coreyjadams please run pip install with the no cache flag. @DaWe35 there's no python 3.11 support yet for 1.13 versions. 3.11 support will be available later with 2.0.

DaWe35 commented 1 year ago

Thank you @jingxu10

(Off: I'm trying to run https://huggingface.co/tiiuae/falcon-7b-instruct on my Arc A370M laptop but I see the GPU is 0% utilized. I'm not sure how to do it and where to ask for help. There is a sample code on the model page, but I have no idea how to make it run on GPU instead of CPU. Thanks)

jingxu10 commented 1 year ago

Change the device_map in transformers.pipeline to xpu? By the way, what is your OS on your A370 laptop?

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

DaWe35 commented 1 year ago

@jingxu10 I just installed Arch, so latest everything Linux. I'll try, thank you

jingxu10 commented 1 year ago

Got it. Please note, by the way, Arch is not in our support list. You may want to try it in a ubuntu 22.04 docker if anything goes wrong, while the KMD driver packages work on the host Arch os.

DaWe35 commented 1 year ago

Is Python 3.10.11 supported? I used pyenv to get 3.10.11. I'm getting OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory when importing torch, not sure why. Also not sure how to run in Docker, sorry.

jingxu10 commented 1 year ago

3.10 is supported. You need to pip install mkl. If you still get this error, please search the path of this dynamic library file, and set the path to LD_LIBRARY_PATH.

DaWe35 commented 1 year ago

I added '/home/user/.pyenv/versions/3.10.11/lib/' to PATH, I also added os.environ['LD_LIBRARY_PATH']='/home/user/.pyenv/versions/3.10.11/lib/' to the beginning of the script, and still the same error. I can confirm, the directory contains a file called libmkl_intel_lp64.so.2

DaWe35 commented 1 year ago

This solved the LD_LIBRARY_PATH issue for me: https://stackoverflow.com/questions/480764/linux-error-while-loading-shared-libraries-cannot-open-shared-object-file-no-s

Now I'm getting this:

ImportError: libmkl_sycl.so.3: cannot open shared object file: No such file or directory

ghost commented 1 year ago

You can see I can get torch 1.13.0a0 only if I drop the git hash that is stated as mandatory. Further, the IPEX components don't seem to be available either. Can you confirm the installation instructions work for XPU?

Yes, the instructions are working for Python 3.9, 3.10 (from my experience, didn't test other versions).

@jingxu10 @tye1 Please confirm ASAP.

I can confirm this. Problem is when unsupported version of Python is used (e.g. 3.11; just like jingxu10 mentioned). Unfortunately, Python has bad error messages, so diagnosis is not very intuitive.

@DaWe35 You should have opened a new ticket for mkl problem, but it's clear what the problem is:

first activate your Python virtual environment (conda/venv/...)
after that source /opt/intel/oneapi/setvars.sh in same terminal
error should be gone

(If you still have this error even after this, please open a new ticket)

coreyjadams commented 1 year ago

This doesn't really resolve anything for me. The git hash listed on the install scripts is not coming up. In fact if I follow the link to the extra-index-url (https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu.html) I see these packages:

Neither the torch wheel nor the IPEX wheel are there - is there an updated url for these wheel files?

I also looked here: https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu-idp.html with likewise no luck.

ghost commented 1 year ago

Then this is a duplicate of #361

I've put the direct links on a gist, you may try downloading directly and installing via local file, but I doubt you'll be able to download the missing files: https://gist.github.com/stacksmash76/756d8f797b38847f2196828f91e5e254

jingxu10 commented 1 year ago

This solved the LD_LIBRARY_PATH issue for me: https://stackoverflow.com/questions/480764/linux-error-while-loading-shared-libraries-cannot-open-shared-object-file-no-s

Now I'm getting this:
ImportError: libmkl_sycl.so.3: cannot open shared object file: No such file or directory

https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html#execution

jingxu10 commented 1 year ago

This doesn't really resolve anything for me. The git hash listed on the install scripts is not coming up. In fact if I follow the link to the extra-index-url (https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu.html) I see these packages:

Neither the torch wheel nor the IPEX wheel are there - is there an updated url for these wheel files?

I also looked here: https://www.intel.com/content/dam/develop/external/us/en/documents/ipex/whl-stable-xpu-idp.html with likewise no luck.

It is highly possible that the content you got was an old cache from your ISP. Would you try access that webpage from cellphone to double confirm?

DaWe35 commented 1 year ago

This solved the LD_LIBRARY_PATH issue for me: https://stackoverflow.com/questions/480764/linux-error-while-loading-shared-libraries-cannot-open-shared-object-file-no-s Now I'm getting this:
ImportError: libmkl_sycl.so.3: cannot open shared object file: No such file or directory
https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html#execution

Hi, I installed conda so we use the same env. source /opt/intel/oneapi/setvars.sh returns:

bash: /opt/intel/oneapi/setvars.sh: No such file or directory

I've re-run the install command from https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html#execution which is different than the readme from this repo. It contains an additional torchvision install, why?

After running the second (verify) command, I got the same ImportError, so torchvision changed nothing.

jingxu10 commented 1 year ago

The getting started expect oneAPI to be installed in its default location (/opt/intel/oneapi). Installation path of oneAPI could be different than the default one. Please find the location in your env, and activate this oneAPI environment. If you don't need to use torchvision, you don't need to install it.

DaWe35 commented 1 year ago

Got it, thank you. I was a bit confused, I though it's enough to run the install commands since I was not looking for "More installation methods" as the readme said. I'll be back after a quick OS resintall, since my root partition is only 20GB, too small for the OpenAPI.

jingxu10 commented 1 year ago

Sure. Thanks.

DaWe35 commented 1 year ago

Thank you guys for the help. I'll summarize what I did and what is the issue now.

Installed https://archlinux.org/packages/extra/x86_64/intel-oneapi-basekit/
Installed python -m pip install torch==1.13.0a0+git6c9b55e intel_extension_for_pytorch==1.13.120+xpu -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers
export LD_LIBRARY_PATH=/opt/intel/oneapi/mkl/2023.1.0/lib/intel64

At this point I no longer have ImportErrors, so this is a good sign. My code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import intel_extension_for_pytorch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="xpu",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

The error I get now:

raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Some comments advised to update torch, but since you use a specific version, I can't try that. Any ideas?

jingxu10 commented 1 year ago

Could you try with the following steps, one-by-one?

remove import intel_extension_for_pytorch
reinstall pytorch with 1.13.1 from pytorch.org

DaWe35 commented 1 year ago

Done, now I see RuntimeError: PyTorch is not linked with support for xpu devices

jingxu10 commented 1 year ago

Where did you invoke to('xpu') in your script? The idea is to verify if the unable to load the model issue took place in IPEX or in Intel released PyTorch binaries or even in stock PyTorch 1.13.1.

DaWe35 commented 1 year ago

... device_map="xpu", ...

I tried changing it back to auto but then I get AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'. Did you mean: '_scaled_dot_product_attention'? Pythorch is still 1.13.1

ghost commented 1 year ago

ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

This is because the transformers package doesn't support XPU as a parameter for device_map when loading models. https://github.com/huggingface/transformers/blob/061580c82c2db1de9139528243e105953793f7a2/src/transformers/modeling_utils.py#L2784

kta-intel commented 1 year ago

@coreyjadams has your issue been resolved or does the do the 1.13.120 wheels still appear missing for you?

@DaWe35

I tried changing it back to auto but then I get AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'. Did you mean: '_scaled_dot_product_attention'?

According to https://huggingface.co/tiiuae/falcon-40b/discussions/12 this seems to be a known issue that was resolved with torch v2.x . We currently don't have a public release for ipex v2.x xpu yet, but it is in the works

raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.") ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

I was not able to reproduce this error on Ubuntu 22.04. Can you try running again, but make two changes:

run source /opt/intel/oneapi/setvars.sh and then activate your ipex environment
call import torch and import intel_extension_for_pytorch before calling any other modules. Possible that the XPU is not being recognized and this has been a workaround in the past while we work to fix it

Also, as noted, Arch is not listed as officially supported, so it may worthwhile to try it on an ubuntu 22.04 docker if we are unable to resolve the issue

DaWe35 commented 1 year ago

I pulled the ubuntu docker image, installed everything. This is the message I see when trying to run the example GPU code:

root@0881bc54a8ef:~# python3 test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/xpu/lazy_init.py:73: UserWarning: DPCPP Device count is zero! (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:120.)
  _C._initExtension()
terminate called after throwing an instance of 'c10::Error'
  what():  dpcppSetDevice: device_id is out of range
Exception raised from dpcppSetDevice at /build/intel-pytorch-extension/csrc/gpu/runtime/Device.cpp:159 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7feece1a1f69 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd5 (0x7feece16acdf in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: xpu::dpcpp::dpcppSetDevice(signed char) + 0x114 (0x7fedd2ef4224 in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #3: xpu::dpcpp::set_device(signed char) + 0x20 (0x7fedd2eafbb0 in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #4: xpu::dpcpp::impl::DPCPPGuardImpl::uncheckedSetDevice(c10::Device) const + 0xd (0x7fedd2eb377d in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #5: at::AtenIpexTypeXPU::resize_impl(c10::TensorImpl*, c10::ArrayRef<long>, c10::optional<c10::ArrayRef<long> >, bool) + 0xb4a (0x7fedd2ee563a in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #6: at::AtenIpexTypeXPU::impl::empty_strided_dpcpp(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) + 0xcb (0x7feddbfbec2b in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #7: at::AtenIpexTypeXPU::empty_strided(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0xe3 (0x7feddbfc71c3 in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #8: <unknown function> + 0x1817a50 (0x7fedd2f6da50 in /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so)
frame #9: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0xf8 (0x7feed00dfec8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x22064ed (0x7feed03e84ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::empty_strided::call(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x1a6 (0x7feed01260c6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x1573730 (0x7feecf755730 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x112d (0x7feecfa6073d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x237fe4d (0x7feed0561e4d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf8 (0x7feecfe151c8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x22067e1 (0x7feed03e87e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf8 (0x7feecfe151c8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x343e17d (0x7feed162017d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x343e610 (0x7feed1620610 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1e5 (0x7feecfe97df5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x104 (0x7feecfa5a234 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x24f7c63 (0x7feed06d9c63 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x1fa (0x7feecfff4a3a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x3a3ae9 (0x7feedb04bae9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #25: <unknown function> + 0x3a3f84 (0x7feedb04bf84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #45: <unknown function> + 0x29d90 (0x7feee335ed90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: __libc_start_main + 0x80 (0x7feee335ee40 in /lib/x86_64-linux-gnu/libc.so.6)

Also I just found https://hub.docker.com/r/intel/intel-extension-for-pytorch by accident - is this up to date? Why this is not in the readme? I find the Installation Guide very confusing, still not sure if I installed everything. No GPU install guide for Ubuntu 22.04, the one recommended by you. I don't get it

DaWe35 commented 1 year ago

I just pulled your official docker container, installed torchvision==0.11.0+cpu (src) and I couldn't even get the readme GPU example to work. I don't understand how does this supposed to work?

jingxu10 commented 1 year ago

Your error message shows GPU devices are not found most likely. You may wish to check if the driver packages are installed correctly or not. Installation guide is at https://dgpu-docs.intel.com/driver/installation.html.
```
terminate called after throwing an instance of 'c10::Error'
what():  dpcppSetDevice: device_id is out of range
```
May I learn more about the confusion regarding to the installation guide? What is the confusion that we don't explain well in the doc? There are basically the following 3 steps.
1. Install Intel GPU Driver
2. Install oneAPI Base Toolkit
3. python -m pip install torch==1.13.0a0+git6c9b55e torchvision==0.14.1a0 intel_extension_for_pytorch==1.13.120+xpu -f https://developer.intel.com/ipex-whl-stable-xpu
Only the installation of driver has OS specific commands, other 2 steps are all the same.
We will update the installation guide for docker usage. You need to run docker with the following command: docker run --rm -it --privileged --device=/dev/dri -v <path on your host>:<path in your container> intel/intel-extension-for-pytorch:<tag> bash

Make sure you have the kernel related driver packages installed on your host machine following driver installation guide at https://dgpu-docs.intel.com/driver/installation.html, before running this docker run command.

Please don't install torchvision separately in the container.

DaWe35 commented 1 year ago

Docker test run and more

Hi, thank you. I started the docker with your command and now I see

ImportError: /usr/local/lib/python3.9/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

I also tried to run the script again without docker, no luck:

AttributeError: module 'torch' has no attribute 'no_grad'

A torch update should fix this, but since you require an old version, I can't update.

I also tried to remove the import intel_extension_for_pytorch from my script, since some of you recommended importing it, some of you don't. Result:

RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
No module named 'torch.distributed'

The first and 3rd issues are possibly a sign of no GPU being found.

Now since you only support Ubuntu, the install instructions are not working on Arch, therefore you recommended me to use Docker. Now I'm not sure why this would fix situation since as I understand the host needs the drivers anyway, but I went ahead. Also the Arch wiki says the driver should be plug and play, so I installed the OpenAPI basekit on my host and tried to run the script. Still, it looks like the GPU is not recognized, so I ran commands to check it, and got mixed results:

GPU tests

lspci -v

00:02.0 VGA compatible controller: Intel Corporation Alder Lake-P Integrated Graphics Controller (rev 0c) (prog-if 00 [VGA controller])
        Subsystem: Hewlett-Packard Company Alder Lake-P Integrated Graphics Controller
        Flags: bus master, fast devsel, latency 0, IRQ 200, IOMMU group 1
        Memory at 612d000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 3000 [size=64]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
        Capabilities: [d0] Power Management version 2
        Capabilities: [100] Process Address Space ID (PASID)
        Capabilities: [200] Address Translation Service (ATS)
        Capabilities: [300] Page Request Interface (PRI)
        Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
        Kernel driver in use: i915
        Kernel modules: i915
...
03:00.0 Display controller: Intel Corporation DG2 [Arc A370M] (rev 05)
        Subsystem: Hewlett-Packard Company DG2 [Arc A370M]
        Flags: bus master, fast devsel, latency 0, IRQ 219, IOMMU group 23
        Memory at 5f000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 6000000000 (64-bit, prefetchable) [size=4G]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
        Capabilities: [d0] Power Management version 3
        Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [420] Physical Resizable BAR
        Capabilities: [400] Latency Tolerance Reporting
        Kernel driver in use: i915
        Kernel modules: i915

lsmod|grep -i vid

video                  73728  1 i915
videodev              372736  3 v4l2_async,v4l2_fwnode,hi556

More info:

$ vainfo
Trying display: wayland
vainfo: VA-API version: 1.18 (libva 2.18.2)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.1.0 ()
...

$ clinfo | head -n 5
Number of platforms                               0

hwinfo --display
06: PCI 300.0: 0380 Display controller                          
  [Created at pci.386]
  Unique ID: svHJ.b8jt1hY1Y97
  Parent ID: GA8e.mr2N3fBJq5F
  SysFS ID: /devices/pci0000:00/0000:00:06.0/0000:01:00.0/0000:02:01.0/0000:03:00.0
  SysFS BusID: 0000:03:00.0
  Hardware Class: graphics card
  Model: "Intel Display controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x5693 
  SubVendor: pci 0x103c "Hewlett-Packard Company"
  SubDevice: pci 0x891d 
  Revision: 0x05
  Driver: "i915"
  Driver Modules: "i915"
  Memory Range: 0x5f000000-0x5fffffff (rw,non-prefetchable)
  Memory Range: 0x6000000000-0x60ffffffff (ro,non-prefetchable)
  IRQ: 219 (137 events)
  Module Alias: "pci:v00008086d00005693sv0000103Csd0000891Dbc03sc80i00"
  Driver Info #0:
    Driver Status: i915 is active
    Driver Activation Cmd: "modprobe i915"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #33 (PCI bridge)

34: PCI 02.0: 0300 VGA compatible controller (VGA)
  [Created at pci.386]
  Unique ID: _Znp.lZEhWUSNRiB
  SysFS ID: /devices/pci0000:00/0000:00:02.0
  SysFS BusID: 0000:00:02.0
  Hardware Class: graphics card
  Model: "Intel VGA compatible controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x46a6 
  SubVendor: pci 0x103c "Hewlett-Packard Company"
  SubDevice: pci 0x891d 
  Revision: 0x0c
  Driver: "i915"
  Driver Modules: "i915"
  Memory Range: 0x612d000000-0x612dffffff (rw,non-prefetchable)
  Memory Range: 0x4000000000-0x400fffffff (ro,non-prefetchable)
  I/O Ports: 0x3000-0x303f (rw)
  Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
  IRQ: 200 (779460 events)
  Module Alias: "pci:v00008086d000046A6sv0000103Csd0000891Dbc03sc00i00"
  Driver Info #0:
    Driver Status: i915 is active
    Driver Activation Cmd: "modprobe i915"
  Config Status: cfg=new, avail=yes, need=no, active=unknown

Primary display adapter: #34

The install guide

The most important parts are so hidden, for example there is a table for "OS & Intel GPU Drivers" where if I click the Stable 602 page it is so useless for me - the only thing I need is the small URL from "7.2.3." what finally takes me to the driver install guide.

Now, I scroll down to 1.4.1.4. and I have no idea if I need to open the Client Usages URL or just go ahead with 1.4.2. or both. If I open it, there is a lot of stuff to install (not sure what I need, but also, I'm still not on Ubuntu, and I still believe I already have a GPU driver automatically installed). I'm also pretty sure it's quiet easy to port these packages to pacman, but I don't even know if I need them at this point.

I close all the 3 opened pages and try to get the OpenAPI basekit. The guide contains the install instructions, but you list it as a requirement as I need to install before reading going ahead - I'd make the Intel® oneAPI Base Toolkit 2023.1 and alsoDPC++ Compiler hotfix` please don't download a random zip I don't even know what to do, just take me to an anchor what doesn't exist now, because for some reason, the main thing is is a small link again: [Install oneAPI Base Toolkit Packages]((https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html) - useless for me, since I need to install it from here: https://archlinux.org/packages/extra/x86_64/intel-oneapi-basekit/

Now the guide says I need to install DPC++ Compiler and Math Kernel Library. How? Isn't it a part of the Basekit? Then finally, instructions for the mystery zip file (DPC++ Compiler hotfix).

Install via wheel files: I don't know what wheels are, but sure, installed all pip dependencies, what failed because I had the wrong python version. Spent an hour trying to remove the default 3.11 Python, then installed pyenv, reported a bug, and installed pipenv what totally ignored the pyenv version I already installed, so I replace both of them with conda and finally I was able to install the pip dependencies.

The only remaining block in the guide is the Install via compiling from source which is hopefully something I can safely ignore for now. And now I understand why everyone is recommending nvidia when someone ask about Arc drivers. on reddit - the portal I can no longer use to get answers, because every community is private now. I really want to make this work but I don't have unlimited free time just to get my GPU work. I wanted to make a fun research in an hour and test the new AI models - and finally use my GPU for something.

jingxu10 commented 1 year ago

Hi, thanks for trying IPEX out on Intel GPUs. There are a bunch of differences among different Linux distributions. It is a challenge for us to support them all. Currently we selected several, including Ubuntu and RHEL, for a large scale validation and support. For distributions not in our list, there could be unexpected stuffs need users to investigate a little bit. We can give some suggestions from personal perspective, but we lack capabilities to provide official support at this moment.

This error normally indicating the PyTorch binary in the environment is not the one that got shipped with IPEX. Would you double confirm if the PyTorch is still the correct version? Seems like other errors are all related to pytorch. ImportError: /usr/local/lib/python3.9/dist-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo

DaWe35 commented 1 year ago

@jingxu10 You're right, I messed up the version. I'm also installing the OpenAPI basekit, is it possible that it is not included in the docker image?

DaWe35 commented 1 year ago

Okay, in Docker I was able to produce this:

clinfo | head -n 5
Number of platforms                               5
  Platform Name                                   Intel(R) FPGA Emulation Platform for OpenCL(TM)
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
  Platform Profile                                EMBEDDED_PROFILE

The script still exits with an error:

/usr/local/lib/python3.9/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Traceback (most recent call last):
  File "/root/main.py", line 10, in <module>
    pipeline = transformers.pipeline(
  File "/usr/local/lib/python3.9/dist-packages/transformers/pipelines/__init__.py", line 788, in pipeline
    framework, model = infer_framework_load_model(
  File "/usr/local/lib/python3.9/dist-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py", line 2801, in from_pretrained
    max_memory = get_balanced_memory(
  File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/modeling.py", line 588, in get_balanced_memory
    per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
ZeroDivisionError: integer division or modulo by zero

if I remove the import intel-extension-for-pytorch, then I get a different error. I think I'll start the dorker from scratch.

kta-intel commented 1 year ago

Possible that the GPU is still not being recognized. What are you getting as an output for num_devices and/or low_zero?

For sanity check, can you try running python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

rahulunair commented 1 year ago

hey @DaWe35 , this might help, i have an unofficial but self contained pytorch xpu tests that uses the docker images: https://github.com/rahulunair/xpu_verify

git clone https://github.com/rahulunair/xpu_verify
cd xpu_verify
./check_pytorch.sh # pulls a pytorch + ipex docker container from intel and tests it . The test takes about 10 seconds after the docker image has been downloaded.

DaWe35 commented 12 months ago

Possible that the GPU is still not being recognized. What are you getting as an output for num_devices and/or low_zero?

For sanity check, can you try running python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

Hi, sorry for the late reply. Here is my output:

warn(f"Failed to load image Python extension: {e}")
1.13.0a0+git6c9b55e
1.13.120+xpu
[0]: _DeviceProperties(name='Intel(R) Arc(TM) A370M Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=3845MB, max_compute_units=128)
[1]: _DeviceProperties(name='Intel(R) Graphics [0x46a6]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=25435MB, max_compute_units=96)

DaWe35 commented 12 months ago

hey @DaWe35 , this might help, i have an unofficial but self contained pytorch xpu tests that uses the docker images: https://github.com/rahulunair/xpu_verify
git clone https://github.com/rahulunair/xpu_verify
cd xpu_verify
./check_pytorch.sh # pulls a pytorch + ipex docker container from intel and tests it . The test takes about 10 seconds after the docker image has been downloaded. 

This is cool, thank you!

Results:

Intel XPU device is available, Device name: Intel(R) Arc(TM) A370M Graphics
Warning: Native FP64 type not supported on this platform
.....
Skipping direct FP64 multiplication tests, as the device doesn't support it.
PyTorch XPU tests successful!

cheperuiz commented 11 months ago

hey @DaWe35 , this might help, i have an unofficial but self contained pytorch xpu tests that uses the docker images: https://github.com/rahulunair/xpu_verify
git clone https://github.com/rahulunair/xpu_verify
cd xpu_verify
./check_pytorch.sh # pulls a pytorch + ipex docker container from intel and tests it . The test takes about 10 seconds after the docker image has been downloaded. 

Wow! This repo is a life saver! thanks for sharing! :)

PurnaChandraPanda commented 9 months ago

@rahulunair I face same problem still today. Read in earlier comments a fix is already released. May I know how to avail the fix here?

Tried the install earlier like:

pip install torch==2.0.1a0 torchvision==0.15.2a0 intel_extension_for_pytorch==2.0.110+xpu -f https://developer.intel.com/ipex-whl-stable-xpu

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/udev/.local/lib/python3.8/site-packages/torch/__init__.py", line 228, in <module>
    _load_global_deps()
  File "/home/udev/.local/lib/python3.8/site-packages/torch/__init__.py", line 187, in _load_global_deps
    raise err
  File "/home/udev/.local/lib/python3.8/site-packages/torch/__init__.py", line 168, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory

StEvUgnIn commented 1 month ago

You are using the Chinese repository. Have you tried the American one?

conda install pkg-config libuv
python -m pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

StEvUgnIn commented 1 month ago

hey @DaWe35 , this might help, i have an unofficial but self contained pytorch xpu tests that uses the docker images: https://github.com/rahulunair/xpu_verify
git clone https://github.com/rahulunair/xpu_verify
cd xpu_verify
./check_pytorch.sh # pulls a pytorch + ipex docker container from intel and tests it . The test takes about 10 seconds after the docker image has been downloaded. 

torch version: 2.0.1a0+cxx11.abi
Warning: Intel XPU device is not available
An error occurred during the test: {e}

What should I do now? Does that mean that my device is not hardware compatible?

Edit:

I have updated locally a line in ./pytorch/xpu_test.py: It displayed the following:

torch version: 2.0.1a0+cxx11.abi
Warning: Intel XPU device is not available
An error occurred during the test: Intel XPU device not detected

jingxu10 commented 1 week ago

Could you try the latest version? What is your device? iGPU or Arc?

intel / intel-extension-for-pytorch