Closed uniartisan closed 3 weeks ago
>>> torch.xpu.empty_cache()
>>> x = torch.rand(46000, 40000, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 6.85 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
>>> x = torch.rand(46000, 10000, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: XPU out of memory. Tried to allocate 1.71 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
>>> x = torch.rand(46000, 1000, dtype=torch.float32, device='xpu')
>>> x = torch.rand(46000, 10000, dtype=torch.float32, device='xpu')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: XPU out of memory. Tried to allocate 1.71 GiB (GPU 0; 15.56 GiB total capacity; 176.00 MiB already allocated; 176.00 MiB reserved in total by PyTorch)
>>>
I learned from the issue that the prompt about exceeding 4GB is normal, but I've noticed that memory requests exceeding 1GB are also not allowed within WSL2, even though my a770 desktop only has Windows 11 installed, so I'm unable to observe the situation under Ubuntu 22.04 for now.
Update: Tested under Ubuntu22.04, seems to be fine.
Not sure if this is related to the device driver or not, but it doesn't seem right to me that users can only allocate <1GB memory here.
Can you run xpu-smi (https://github.com/intel/xpumanager/releases) to check "Memory pysical size" "Max Mem Alloc Size" $ xpu-smi discovery -d 0
@jgong5 @feng-intel I apologize, but due to the fact that I have already switched the physical system to Ubuntu, I am temporarily unable to reproduce the issue. The strange thing is that while my 5GB model can be loaded onto the GPU (though perhaps not into a single memory heap), the code above still encounters problems.
I can't reproduce the issue on my ARC770. If you still has problem, please open this issue and provide how to reproduce it. Thanks.
Describe the bug
Minimum Codes
Traceback (most recent call last):
Additional Details:
When moving the tensor of size (80, 1584, 2048) to the XPU device for the first time, the operation succeeds, and it shows that the tensor occupies approximately 990 MB of memory. However, when attempting to move a slightly larger tensor (size (83, 1584, 2048)) to the XPU device, the "XPU out of memory" error is thrown, even though the XPU device has a total capacity of 15.56 GB.
Steps to Reproduce:
Install the required dependencies (PyTorch and IPEX) in a WSL2 environment. Run the provided minimal reproducible code in a Python interactive session or script. Observe the "XPU out of memory" error when attempting to move the larger tensor to the XPU device.
Expected Behavior: The larger tensor should be successfully moved to the XPU device without encountering an "XPU out of memory" error, as the XPU device has sufficient total capacity. Actual Behavior: An "XPU out of memory" error is thrown when attempting to move the larger tensor to the XPU device, despite the XPU device having enough total capacity. I would appreciate any assistance or guidance in resolving this issue. Please let me know if you require any additional information or clarification.
Versions
Environment Information:
Operating System: Windows Subsystem for Linux 2 (WSL2) Ubuntu22.04 Python Version: 3.10.14 Driver versison: Intel® Graphics Driver 31.0.101.5522 (WHQL Certified)**
Versions: