log_softmax operation fails on XPU with "Kernel is incompatible with all devices" error on arc a770

uniartisan commented 3 months ago

🐛 Describe the bug

import torch
import torch.nn.functional as F

def test_log_softmax(device):
    print(f"Testing on {device}")

    input_tensor = torch.randn(3, 4, 5, device=device)
    temperature = 1.0

    try:
        log_probabilities = torch.log_softmax(input_tensor / temperature, dim=-1)
        print("Log softmax operation successful")
        print(f"Output shape: {log_probabilities.shape}")
    except RuntimeError as e:
        print(f"RuntimeError occurred: {str(e)}")

test_log_softmax('cpu')

if torch.xpu.is_available():
    test_log_softmax('xpu')
else:
    print("XPU not available on this system")

My environment is WSL, and pytorch 2.5 from source. and my card is ARC A770 log_softmax operation fails on XPU with "Kernel is incompatible with all devices" error

Description: While attempting to use the log_softmax operation on an XPU device, an error occurs indicating that the kernel is incompatible with all devices, despite recent commits purportedly adding support for this operation on XPU.

Error message: RuntimeError: Kernel is incompatible with all devices in devs

Steps to reproduce:

Set up a PyTorch environment with XPU support
Create a random tensor on the XPU device
Attempt to apply torch.nn.functional.log_softmax to the tensor

Expected behavior: The log_softmax operation should execute successfully on the XPU device.

Actual behavior: The operation fails with the RuntimeError stating the kernel is incompatible with all devices.

Versions

PyTorch version: 2.5.0a0+git4073f73 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (conda-forge gcc 14.1.0-0) 14.1.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.30.0 Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i7-13700KF CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 BogoMIPS: 6835.20 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 384 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 16 MiB (8 instances) L3 cache: 30 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] flake8==7.1.0 [pip3] numpy==1.26.4 [pip3] optree==0.12.1 [pip3] torch==2.5.0a0+gitac2e603 [pip3] torchao==0.3.1 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] optree 0.12.1 pypi_0 pypi [conda] torch 2.5.0a0+gitac2e603 pypi_0 pypi [conda] torchao 0.3.1 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi

uniartisan commented 3 months ago

similar to https://github.com/intel/torch-xpu-ops/issues/628 and to pull request https://github.com/intel/torch-xpu-ops/pull/511 It must set:

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1

fengyuan14 commented 3 months ago

Hi, @uniartisan. Likely you are working on a system with both iGPU and dGPU (ARC). The operator should be compatible on ARC. So I assume xpu in test_log_softmax('xpu') indicate to the iGPU. Please try torch.xpu.get_device_properties('xpu:0') and torch.xpu.get_device_properties('xpu:1') to check which one is dGPU on your system exposed by PyTorch. BTW, by default xpu means xpu:0.

fengyuan14 commented 3 weeks ago

@daisyden Could you help verify the case on the ARC? Assume our SYCL kernel implementation is FP64 irrelevant, and should work on ARC.

PenghuiCheng commented 3 weeks ago

@daisyden Could you help verify the case on the ARC? Assume our SYCL kernel implementation is FP64 irrelevant, and should work on ARC.

This case is passed on master PyTorch(f3c3f3a3c39a359af6f06619e44e0d6a26b58e6d) and torch-xpu-ops(94d0ee6858633f00629ad1980d84df53b761fc8a)

intel / torch-xpu-ops

log_softmax operation fails on XPU with "Kernel is incompatible with all devices" error on arc a770 #664

🐛 Describe the bug

Versions