intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

[Arc] XPU backend: fp64 not supported #628

Open uniartisan opened 1 month ago

uniartisan commented 1 month ago

🐛 Describe the bug

import torch

assert torch.xpu.is_available(), "Intel XPU is not available"

batch_size = 4
vocab_size = 4
# RuntimeError: Required aspect fp64 is not supported on the device
out = torch.randn(batch_size, vocab_size, device='xpu').to(torch.bfloat16)

Description:

I've encountered issue while using the Intel XPU backend with PyTorch:

  1. FP64 not supported on XPU device

When trying to create a random tensor and convert it to bfloat16, I receive the following error:

out = torch.randn(batch_size, vocab_size, device='xpu').to(torch.bfloat16)
# RuntimeError: Required aspect fp64 is not supported on the device

This error suggests that the XPU device does not support fp64 operations. However, the torch.randn() function seems to be attempting to use fp64 internally before converting to the desired dtype.

Proposed solution: Consider modifying the torch.randn() implementation to use fp32 as an intermediate type when fp64 is not supported on the device.

Thank you for your attention to these matters. Let me know if you need any additional information.

Versions

Collecting environment information...
PyTorch version: 2.5.0a0+git0022dc0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (conda-forge gcc 14.1.0-0) 14.1.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i7-13700KF
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           6835.19
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          384 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           16 MiB (8 instances)
L3 cache:                           30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flake8==7.1.0
[pip3] numpy==1.26.4
[pip3] optree==0.12.1
[pip3] torch==2.5.0a0+git0022dc0
[pip3] torchao==0.3.1
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] optree                    0.12.1                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0022dc0           dev_0    <develop>
[conda] torchao                   0.3.1                    pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
uniartisan commented 1 month ago
x = torch.randn(4, 4, dtype=torch.float32, device='xpu')
x.to(torch.float16)

This will fail. Traceback (most recent call last): File "/home/lzy/workspace/demo/test4.py", line 37, in x.to(torch.float16) RuntimeError: Required aspect fp64 is not supported on the device

My torch is compiled with commit: a0edb44500608ff9148a4ff0331869734dc709f3

dvrogozh commented 1 month ago

Which Intel GPU are you using? According to "13th Gen Intel(R) Core(TM) i7-13700KF" you have Raptor Lake (https://www.intel.com/content/www/us/en/products/sku/230489/intel-core-i713700kf-processor-30m-cache-up-to-5-40-ghz/specifications.html). Do you have Intel discrete GPU as well on the system?

I would suggest to file separate issues for fp64 and to request masked_select. These are independent issues and would better be tracked/discussed if separate.

For fp64. At the moment support for XPU backend is formally available for Intel® Data Center GPU Max Series platforms (formerly codename Ponte Vecchio or PVC). See https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html. However, if you will look into the sources, you will see that sycl kernels are compiled for PVC and MTL (Meteor Lake), but MTL support is not formally announced. Other platforms are formally not supported. Assuming you are using Raptor Lake, you have implicit request in your issue to support this platform.

You still might be able to run XPU backend on your system (assuming Raptor Lake) due to online kernels re-compilation. Mind that this will have a drawback via time required to recompile. And I afraid there is no guarantee that kernels will be able to run correctly after recompilation. This trick works for myself on ATS-M platform. It should also work for Alchemist since platform architecture is the same as for ATS-M. I don't know about Raptor Lake. On Alder Lake trick did not work for me - I saw runtime failures.

At the moment you can try to bypass fp64 issue by enabling fp64 emulation. This can be done by exporting the following environment variables, one for compilation, one for runtime:

export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
uniartisan commented 1 month ago

I'm using Arc a770.

https://github.com/intel/torch-xpu-ops/compare/main...uniartisan:torch-xpu-ops:main

Issue Overview: There's a problem converting tensors to bf16 (BFloat16) type on XPU devices. Specifically, calling tensor.to(bf16) doesn't work correctly. or other data formats.

Call Path:

  1. tensor.to(bf16) is called
  2. This invokes the to_impl function
  3. to_impl ultimately calls at::_to_copy
  4. at::_tocopy creates a new destination tensor and calls r.copy(self, non_blocking)
  5. For XPU devices, the copy operation calls XPUNativeFunctions::copy
  6. XPUNativeFunctions::copy_ calls native::xpu::_copy_xpu
  7. The _copy_xpu function creates a TensorIterator, then calls copy_device_to_device
    if (dst_device.is_xpu() && src_device.is_xpu()) {
    copy_device_to_device(iter, non_blocking, p2p_enabled);
    return;
    }
  8. copy_device_to_device

    
    bool memcpy_eligible =
      same_type && same_conj && same_neg && iter.is_contiguous();
    
    Device dst_device = iter.device(0);
    Device src_device = iter.device(1);
    
    c10::DeviceGuard device_guard(src_device);
    
    // We always perform the copy on the source device, using the current stream
    // on the source device, and we fully synchronize on both src and dst's
    // current streams for completion of the copy.
    XPUStream copy_stream = getCurrentXPUStream(src_device.index());
    if (src_device != dst_device) {
    // This is a cross-device copy on the src current stream and dst current
    // stream. We perform a two-way barrier between both devices' streams
    // before the copy. This ensures that any write-after-write and
    // write-after-read dependencies on the destination side are handled, so
    // that no one is operating on the dst memory when we perform the copy.
    // src waits on dst barrier (src already waits on src)
    XPUEvent dst_ready;
    device_guard.set_index(dst_device.index());
    dst_ready.record(getCurrentXPUStream(dst_device.index()));
    
    device_guard.set_index(src_device.index());
    dst_ready.block(copy_stream);
    }
    
    if (memcpy_eligible) {
    // SYCL queue.memcpy performance is worse than SYCL copy kernel
    // implementation. JIRA:
    // https://jira.devtools.intel.com/browse/CMPLRLLVM-41292
    memcpyAsync(iter, copy_stream, p2p_enabled);
    } else {
    if (same_neg) {
      if (!same_conj) {
        conj_kernel(iter);
      } else {
        copy_kernel(iter);
      }
    } else {
      if (!same_conj) {
        neg_conj_kernel(iter);
      } else {
        neg_kernel(iter);
      }
    }
    }
because same type is false then memcpy_eligible is False. It finnally comes to copy_kernel
9. finally copy_kernel function

Core Issue:
In the final _copy_xpu function, the conversion from other types to dst format is not properly handled. This function mainly deals with device-to-device copying but doesn't specifically handle data type conversions.
However when I come to Pytorch's implementation, there is a difference in the final copy_kernel. Therefore I choose to modify the final function.

In PyTorch's implementation, there's a check for type mismatch between the destination and source tensors, which then calls a copy_kernel. This is a crucial point that needs to be addressed in the XPU implementation.
https://github.com/pytorch/pytorch/blob/26383a6cc0197a30fee3d2d3f0626ed342fc9a28/aten/src/ATen/native/cpu/CopyKernel.cpp#L287

In the specific case of copy operations, using `common_dtype()` may not be the best choice, which is why using `iter.dtype(0)` in your modification is more appropriate:

1. Specificity of copy operations: In copy operations, we typically want the type of the target tensor to remain unchanged, rather than changing to some "common" type.

2. Explicit target type: Using `iter.dtype(0)` explicitly specifies the target type we want to copy to, which makes more sense in copy operations.

3. Avoiding unnecessary conversions: If `common_dtype()` is used, it might lead to some unnecessary type conversions, especially when the source and target types are already different.

4. Maintaining original intent: In copy operations, users usually expect the data to be copied into a tensor of the specified type, rather than being converted to some common type.

However, it still reminds me of fp64.
When I set:

export OverrideDefaultFP64Settings=1

I don't need to set:

export IGC_EnableDPEmulation=1

the tensor transform now becomes normal. 

I find:

define AT_FLOATING_TYPES c10::kDouble, c10::kFloat

This will tell compiler generate code which support fp64 , However I didn't use it....

Update:

template <typename scalar_t, typename return_t = scalar_t, typename func_t> void opmath_symmetric_gpu_kernel_with_scalars( TensorIteratorBase& iter, const func_t& f) { // Use symmetric property of the functor to reduce number of kernels, // requires f(a, b) == f(b, a) TORCH_INTERNAL_ASSERT(iter.ntensors() == 3);

using traits = function_traits; using opmath_arg_t = typename traits::template arg<0>::type; static_assert( traits::arity == 2, "gpu_kernel_with_scalars only supports two input arguments"); static_assert( std::is_same<opmath_arg_t, typename traits::template arg<1>::type>::value, "f is not symmetric");

OptionalDeviceGuard device_guard; opmath_arg_t scalar_val{};

if (iter.is_cpu_scalar(1)) { scalar_val = iter.scalar_value(1); iter.remove_operand(1);

device_guard.reset_device(iter.device(1));

} else if (iter.is_cpu_scalar(2)) { scalar_val = iter.scalar_value(2); iter.remove_operand(2); }

if (iter.ninputs() == 2) { gpu_kernel(iter, BinaryFunctor<scalar_t, scalar_t, return_t, func_t>(f)); } else { AUnaryFunctor<scalar_t, scalar_t, return_t, func_t> unary_f(f, scalar_val); gpu_kernel(iter, unary_f); } }



I see cast here. torch-xpu-ops/src/Aten/native/xpu/sycl/Loops.h
dvrogozh commented 1 month ago

I think I did step into similar issue on print(tensor) where pytorch tried to cast fp16/bf16 to fp32. Everything looked similar - no usage of fp64 on user side, still the runtime error. And stack you describe above looks similar. I can't say that your problem is 100% same as what I saw since I did not debug yours, but here is an outcome of debugging my issue. It occurs that XPU copy kernel is implemented via PyTorch cpu functions which sycl compiler compiles into the GPU kernel. So, with my issue kernel is implemented thru this function: https://github.com/pytorch/pytorch/blob/56bb047449b8fa8406662d447512da0b9e4c147e/c10/core/DynamicCast.h#L73:

template <typename dest_t>
C10_HOST_DEVICE inline dest_t fetch_and_cast(
    const ScalarType src_type,
    const void* ptr) {
  switch (src_type) {
    AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(FETCH_AND_CAST_CASE)
    FETCH_AND_CAST_CASE(uint16_t, UInt16)
    FETCH_AND_CAST_CASE(uint32_t, UInt32)
    FETCH_AND_CAST_CASE(uint64_t, UInt64)
    default:
      ERROR_UNSUPPORTED_CAST
  }
  return dest_t(0); // just to avoid compiler warning
}

Pay attention that this is a single kernel handling multiple cast paths via switch. As a result, while operations requested by user did not need any fp64 ops, kernel itself is designed in a way that it has fp64 path inside. And even if not triggered by user request, this kernel may fail to compile (via online compiler as I explained before) and probably even at runtime if stack formally checks kernel for fp64 ops even if they won't actually be executed (this is my theory how stack behaves).

uniartisan commented 1 month ago

I see. By the way, I have update my debug path above.

I remember that in the previously compiled version (the version I hadn't modified), although it simulated, I noticed that the data types weren't correctly cast. I corrected the casting through debugging and making changes above. However, since compilation is very slow on my computer, I haven't re-verified the initial version. If it's convenient for you, could you check if there's a correct cast after .to(bf16) when the simulation environment variable is set? Regarding the issue of throwing a warning and then terminating, do we have any solutions? For example, could we enable fp64 simulation by default and print warning logs when users actually use it? https://github.com/pytorch/pytorch/blob/26383a6cc0197a30fee3d2d3f0626ed342fc9a28/torch/xpu/__init__.py

We can add add:

import os

def set_xpu_environment():
    incompatible_devices = [i for i in range(torch.xpu.device_count()) 
                            if not get_device_capability(i).get('has_fp64', True)]

    if incompatible_devices:
        os.environ['OverrideDefaultFP64Settings'] = '1'
        os.environ['IGC_EnableDPEmulation'] = '1'
        print(f"FP64 incompatible devices found: {incompatible_devices}")
        print("Set OverrideDefaultFP64Settings and IGC_EnableDPEmulation to 1, FP64 maibe slower.")

set_xpu_environment()
fengyuan14 commented 1 month ago

@uniartisan Thanks for your feedback. You might get two issues due to lack of FP64 support on Arc.

  1. random. Even your operator is FP64 irrelevant, runtime compilation gets failure due to per-source build. Per-source build means all kernels including FP64 involved in the same source file are built when calling one of them. Driver raises building error. AOT build (building machine code statically) should help on this.
  2. Exactly as @dvrogozh mentioned, the kernel of data conversion operator tensor.to(dtype) uses dynamic cast, where there is a switch-case for FP64. For the case, we are evaluating how to walk around it.
fengyuan14 commented 1 month ago
uniartisan commented 1 month ago

Thanks!

dvrogozh commented 1 month ago

@fengyuan14 : I see #511 got merged. However, I see that kernels compilation for Arc (build flag) was dropped from this PR. Can you, please, help to understand what's the plan for Arc? will such build flag be added and we should expect another PR or Arc will be supported via online kernels compilation?

fengyuan14 commented 1 month ago

We are investigating the bin size, building time and reasonability and compatibility if Arc AOT build is enabled. When all of these are acceptable, Arc AOT build will be added by default in build-from-source. BTW, you also can enable Arc AOT build by TORCH_XPU_ARCH_LIST= to override default targets.

As for binary release, I think there should be some wheel including Arc AOT. We are evaluating if the unified wheel including AOT build of all verified platform is acceptable, reasonable and compatible, or there is a separate wheel for Arc AOT.

For example, Arc wheels built on LTS driver (different IGC) may not work. So I am not sure of it, but guess,

  1. Release a wheel with PVC and MTL built on LTS driver.
  2. Release a wheel with Arc built on rolling driver.

Your can reach River or Eikan for the release distributions. @riverliuintel @EikanWang Please correct me.

riverliuintel commented 1 month ago

For PyTorch 2.5 release, ARC with partial FP64 emulation will be support by source build on rolling driver. Whether includes ARC support in Torch 2.5 AOT binary release is a open question for now. We are closely pushing for this direction.

riverliuintel commented 3 weeks ago

Confirmed. PyTorch 2.5 release binary will support ARC AOT with FP64 partial FP64 emulation.