[WSL]"XPU out of memory" error when using to("xpu") method in Intel PyTorch Extension (IPEX)

uniartisan commented 3 weeks ago

Describe the bug

Minimum Codes

import torch
import intel_extension_for_pytorch as ipex

# Create a tensor of size (80, 1584, 2048) filled with zeros
state = torch.zeros(80, 1584, 2048)

# Move the tensor to the XPU device
state.to("xpu")

# Create a slightly larger tensor filled with zeros
state = torch.zeros(83, 1584, 2048)

# Move the new tensor to the XPU device
state.to("xpu")  # This line throws the "XPU out of memory" error

Traceback (most recent call last):

RuntimeError: XPU out of memory. Tried to allocate 1.00 GiB (GPU 0; 15.56 GiB total capacity; 990.00 MiB already allocated; 990.00 MiB reserved in total by PyTorch)

Additional Details:

When moving the tensor of size (80, 1584, 2048) to the XPU device for the first time, the operation succeeds, and it shows that the tensor occupies approximately 990 MB of memory. However, when attempting to move a slightly larger tensor (size (83, 1584, 2048)) to the XPU device, the "XPU out of memory" error is thrown, even though the XPU device has a total capacity of 15.56 GB.

Steps to Reproduce:

Install the required dependencies (PyTorch and IPEX) in a WSL2 environment. Run the provided minimal reproducible code in a Python interactive session or script. Observe the "XPU out of memory" error when attempting to move the larger tensor to the XPU device.

Expected Behavior: The larger tensor should be successfully moved to the XPU device without encountering an "XPU out of memory" error, as the XPU device has sufficient total capacity. Actual Behavior: An "XPU out of memory" error is thrown when attempting to move the larger tensor to the XPU device, despite the XPU device having enough total capacity. I would appreciate any assistance or guidance in resolving this issue. Please let me know if you require any additional information or clarification.

Versions

Environment Information:

Operating System: Windows Subsystem for Linux 2 (WSL2) Ubuntu22.04 Python Version: 3.10.14 Driver versison: Intel® Graphics Driver 31.0.101.5522 (WHQL Certified)**

Versions:

Collecting environment information...
PyTorch version: 2.1.0.post2+cxx11.abi
PyTorch CXX11 ABI: Yes
IPEX version: 2.1.30+xpu
IPEX commit: 474a6b3cb
Build type: Release

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
IGC version: 2024.1.0 (2024.1.0.20240308)
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: 2024.1
MKL version: 2024.1
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.28202', has_fp64=0, total_memory=15930MB, max_compute_units=512, gpu_eu_count=512)
Intel OpenCL ICD version: 23.52.28202.52-821~22.04
Level Zero version: 1.3.28202.52-821~22.04

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i7-13700KF
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           6835.19
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          576 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           24 MiB (12 instances)
L3 cache:                           30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.1.30+xpu
[pip3] numpy==1.26.4
[pip3] torch==2.1.0.post2+cxx11.abi
[pip3] torchaudio==2.1.0.post2+cxx11.abi
[pip3] torchvision==0.16.0.post2+cxx11.abi
[conda] intel-extension-for-pytorch 2.1.30+xpu               pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.1.0.post2+cxx11.abi          pypi_0    pypi
[conda] torchaudio                2.1.0.post2+cxx11.abi          pypi_0    pypi
[conda] torchvision               0.16.0.post2+cxx11.abi          pypi_0    pypi

uniartisan commented 3 weeks ago

>>> torch.xpu.empty_cache()        
>>> x = torch.rand(46000, 40000, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 6.85 GiB (GPU  0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
>>> x = torch.rand(46000, 10000, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: XPU out of memory. Tried to allocate 1.71 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
>>> x = torch.rand(46000, 1000, dtype=torch.float32, device='xpu')
>>> x = torch.rand(46000, 10000, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: XPU out of memory. Tried to allocate 1.71 GiB (GPU 0; 15.56 GiB total capacity; 176.00 MiB already allocated; 176.00 MiB reserved in total by PyTorch)
>>>

I learned from the issue that the prompt about exceeding 4GB is normal, but I've noticed that memory requests exceeding 1GB are also not allowed within WSL2, even though my a770 desktop only has Windows 11 installed, so I'm unable to observe the situation under Ubuntu 22.04 for now.

Update: Tested under Ubuntu22.04, seems to be fine.

jgong5 commented 3 weeks ago

Not sure if this is related to the device driver or not, but it doesn't seem right to me that users can only allocate <1GB memory here.

feng-intel commented 3 weeks ago

Can you run xpu-smi (https://github.com/intel/xpumanager/releases) to check "Memory pysical size" "Max Mem Alloc Size" $ xpu-smi discovery -d 0

uniartisan commented 3 weeks ago

@jgong5 @feng-intel I apologize, but due to the fact that I have already switched the physical system to Ubuntu, I am temporarily unable to reproduce the issue. The strange thing is that while my 5GB model can be loaded onto the GPU (though perhaps not into a single memory heap), the code above still encounters problems.

feng-intel commented 3 weeks ago

I can't reproduce the issue on my ARC770. If you still has problem, please open this issue and provide how to reproduce it. Thanks.

intel / intel-extension-for-pytorch

[WSL]"XPU out of memory" error when using to("xpu") method in Intel PyTorch Extension (IPEX) #629

Describe the bug

Versions