I tried torch.linalg.svd a Max Series GPU using the Intel Devcloud and packages from the intel conda channel, and while I cannot reproduce the segfault, the performance on XPU is lower than on CPU:
>>> import dpctl
>>> dpctl.get_devices()
[<dpctl.SyclDevice [backend_type.opencl, device_type.cpu, Intel(R) Xeon(R) Platinum 8480+] at 0x15040cdd6bf0>,
<dpctl.SyclDevice [backend_type.opencl, device_type.accelerator, Intel(R) FPGA Emulation Device] at 0x15040cdd6db0>,
<dpctl.SyclDevice [backend_type.level_zero, device_type.gpu, Intel(R) Data Center GPU Max 1100] at 0x15040cdd6df0>]
>>> import intel_extension_for_pytorch
>>> import torch
>>> intel_extension_for_pytorch.__version__
'2.0.110+xpu'
>>> torch.__version__
'2.0.1a0+cxx11.abi'
>>> data_cpu = torch.randn(4096, 2046)
>>> %time _ = torch.linalg.svd(data_cpu, full_matrices=False)
CPU times: user 1min 1s, sys: 2.3 s, total: 1min 3s
Wall time: 783 ms
>>> data_xpu = data_cpu.to("xpu")
>>> %time _ = torch.linalg.svd(data_xpu, full_matrices=False)
CPU times: user 3min 52s, sys: 1.14 s, total: 3min 53s
Wall time: 2.31 s
Also note that I checked that a GEMM on a Max Series GPU is approximately 4 to 5x faster on XPU than on CPU:
>>> A_cpu, B_cpu = torch.randn(4096, 4096), torch.randn(4096, 4096)
>>> A_xpu, B_xpu = A_cpu.to("xpu"), B_cpu.to("xpu")
>>> %timeit (A_cpu @ B_cpu)[0, 0].item()
29 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit (A_xpu @ B_xpu)[0, 0].item()
6.91 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I would have expected a similar 4-5x speed-up for SVD on XPU instead of 3x slowdown.
Describe the bug
I tried
torch.linalg.svd
a Max Series GPU using the Intel Devcloud and packages from theintel
conda channel, and while I cannot reproduce the segfault, the performance on XPU is lower than on CPU:Also note that I checked that a GEMM on a Max Series GPU is approximately 4 to 5x faster on XPU than on CPU:
I would have expected a similar 4-5x speed-up for SVD on XPU instead of 3x slowdown.
Versions