Open uniartisan opened 1 month ago
x = torch.randn(4, 4, dtype=torch.float32, device='xpu')
x.to(torch.float16)
This will fail. Traceback (most recent call last):
File "/home/lzy/workspace/demo/test4.py", line 37, in
My torch is compiled with commit: a0edb44500608ff9148a4ff0331869734dc709f3
Which Intel GPU are you using? According to "13th Gen Intel(R) Core(TM) i7-13700KF" you have Raptor Lake (https://www.intel.com/content/www/us/en/products/sku/230489/intel-core-i713700kf-processor-30m-cache-up-to-5-40-ghz/specifications.html). Do you have Intel discrete GPU as well on the system?
I would suggest to file separate issues for fp64 and to request masked_select
. These are independent issues and would better be tracked/discussed if separate.
For fp64. At the moment support for XPU backend is formally available for Intel® Data Center GPU Max Series platforms (formerly codename Ponte Vecchio or PVC). See https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html. However, if you will look into the sources, you will see that sycl kernels are compiled for PVC and MTL (Meteor Lake), but MTL support is not formally announced. Other platforms are formally not supported. Assuming you are using Raptor Lake, you have implicit request in your issue to support this platform.
You still might be able to run XPU backend on your system (assuming Raptor Lake) due to online kernels re-compilation. Mind that this will have a drawback via time required to recompile. And I afraid there is no guarantee that kernels will be able to run correctly after recompilation. This trick works for myself on ATS-M platform. It should also work for Alchemist since platform architecture is the same as for ATS-M. I don't know about Raptor Lake. On Alder Lake trick did not work for me - I saw runtime failures.
At the moment you can try to bypass fp64 issue by enabling fp64 emulation. This can be done by exporting the following environment variables, one for compilation, one for runtime:
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
I'm using Arc a770.
https://github.com/intel/torch-xpu-ops/compare/main...uniartisan:torch-xpu-ops:main
Issue Overview: There's a problem converting tensors to bf16 (BFloat16) type on XPU devices. Specifically, calling tensor.to(bf16) doesn't work correctly. or other data formats.
Call Path:
if (dst_device.is_xpu() && src_device.is_xpu()) {
copy_device_to_device(iter, non_blocking, p2p_enabled);
return;
}
copy_device_to_device
bool memcpy_eligible =
same_type && same_conj && same_neg && iter.is_contiguous();
Device dst_device = iter.device(0);
Device src_device = iter.device(1);
c10::DeviceGuard device_guard(src_device);
// We always perform the copy on the source device, using the current stream
// on the source device, and we fully synchronize on both src and dst's
// current streams for completion of the copy.
XPUStream copy_stream = getCurrentXPUStream(src_device.index());
if (src_device != dst_device) {
// This is a cross-device copy on the src current stream and dst current
// stream. We perform a two-way barrier between both devices' streams
// before the copy. This ensures that any write-after-write and
// write-after-read dependencies on the destination side are handled, so
// that no one is operating on the dst memory when we perform the copy.
// src waits on dst barrier (src already waits on src)
XPUEvent dst_ready;
device_guard.set_index(dst_device.index());
dst_ready.record(getCurrentXPUStream(dst_device.index()));
device_guard.set_index(src_device.index());
dst_ready.block(copy_stream);
}
if (memcpy_eligible) {
// SYCL queue.memcpy performance is worse than SYCL copy kernel
// implementation. JIRA:
// https://jira.devtools.intel.com/browse/CMPLRLLVM-41292
memcpyAsync(iter, copy_stream, p2p_enabled);
} else {
if (same_neg) {
if (!same_conj) {
conj_kernel(iter);
} else {
copy_kernel(iter);
}
} else {
if (!same_conj) {
neg_conj_kernel(iter);
} else {
neg_kernel(iter);
}
}
}
because same type is false then memcpy_eligible is False. It finnally comes to copy_kernel
9. finally copy_kernel function
Core Issue:
In the final _copy_xpu function, the conversion from other types to dst format is not properly handled. This function mainly deals with device-to-device copying but doesn't specifically handle data type conversions.
However when I come to Pytorch's implementation, there is a difference in the final copy_kernel. Therefore I choose to modify the final function.
In PyTorch's implementation, there's a check for type mismatch between the destination and source tensors, which then calls a copy_kernel. This is a crucial point that needs to be addressed in the XPU implementation.
https://github.com/pytorch/pytorch/blob/26383a6cc0197a30fee3d2d3f0626ed342fc9a28/aten/src/ATen/native/cpu/CopyKernel.cpp#L287
In the specific case of copy operations, using `common_dtype()` may not be the best choice, which is why using `iter.dtype(0)` in your modification is more appropriate:
1. Specificity of copy operations: In copy operations, we typically want the type of the target tensor to remain unchanged, rather than changing to some "common" type.
2. Explicit target type: Using `iter.dtype(0)` explicitly specifies the target type we want to copy to, which makes more sense in copy operations.
3. Avoiding unnecessary conversions: If `common_dtype()` is used, it might lead to some unnecessary type conversions, especially when the source and target types are already different.
4. Maintaining original intent: In copy operations, users usually expect the data to be copied into a tensor of the specified type, rather than being converted to some common type.
However, it still reminds me of fp64.
When I set:
export OverrideDefaultFP64Settings=1
I don't need to set:
export IGC_EnableDPEmulation=1
the tensor transform now becomes normal.
I find:
This will tell compiler generate code which support fp64 , However I didn't use it....
Update:
template <typename scalar_t, typename return_t = scalar_t, typename func_t> void opmath_symmetric_gpu_kernel_with_scalars( TensorIteratorBase& iter, const func_t& f) { // Use symmetric property of the functor to reduce number of kernels, // requires f(a, b) == f(b, a) TORCH_INTERNAL_ASSERT(iter.ntensors() == 3);
using traits = function_traits
OptionalDeviceGuard device_guard; opmath_arg_t scalar_val{};
if (iter.is_cpu_scalar(1)) {
scalar_val = iter.scalar_value
device_guard.reset_device(iter.device(1));
} else if (iter.is_cpu_scalar(2)) {
scalar_val = iter.scalar_value
if (iter.ninputs() == 2) { gpu_kernel(iter, BinaryFunctor<scalar_t, scalar_t, return_t, func_t>(f)); } else { AUnaryFunctor<scalar_t, scalar_t, return_t, func_t> unary_f(f, scalar_val); gpu_kernel(iter, unary_f); } }
I see cast here. torch-xpu-ops/src/Aten/native/xpu/sycl/Loops.h
I think I did step into similar issue on print(tensor)
where pytorch tried to cast fp16/bf16 to fp32. Everything looked similar - no usage of fp64 on user side, still the runtime error. And stack you describe above looks similar. I can't say that your problem is 100% same as what I saw since I did not debug yours, but here is an outcome of debugging my issue. It occurs that XPU copy kernel is implemented via PyTorch cpu functions which sycl compiler compiles into the GPU kernel. So, with my issue kernel is implemented thru this function: https://github.com/pytorch/pytorch/blob/56bb047449b8fa8406662d447512da0b9e4c147e/c10/core/DynamicCast.h#L73:
template <typename dest_t>
C10_HOST_DEVICE inline dest_t fetch_and_cast(
const ScalarType src_type,
const void* ptr) {
switch (src_type) {
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(FETCH_AND_CAST_CASE)
FETCH_AND_CAST_CASE(uint16_t, UInt16)
FETCH_AND_CAST_CASE(uint32_t, UInt32)
FETCH_AND_CAST_CASE(uint64_t, UInt64)
default:
ERROR_UNSUPPORTED_CAST
}
return dest_t(0); // just to avoid compiler warning
}
Pay attention that this is a single kernel handling multiple cast paths via switch
. As a result, while operations requested by user did not need any fp64 ops, kernel itself is designed in a way that it has fp64 path inside. And even if not triggered by user request, this kernel may fail to compile (via online compiler as I explained before) and probably even at runtime if stack formally checks kernel for fp64 ops even if they won't actually be executed (this is my theory how stack behaves).
I see. By the way, I have update my debug path above.
I remember that in the previously compiled version (the version I hadn't modified), although it simulated, I noticed that the data types weren't correctly cast. I corrected the casting through debugging and making changes above. However, since compilation is very slow on my computer, I haven't re-verified the initial version. If it's convenient for you, could you check if there's a correct cast after .to(bf16) when the simulation environment variable is set? Regarding the issue of throwing a warning and then terminating, do we have any solutions? For example, could we enable fp64 simulation by default and print warning logs when users actually use it? https://github.com/pytorch/pytorch/blob/26383a6cc0197a30fee3d2d3f0626ed342fc9a28/torch/xpu/__init__.py
We can add add:
import os
def set_xpu_environment():
incompatible_devices = [i for i in range(torch.xpu.device_count())
if not get_device_capability(i).get('has_fp64', True)]
if incompatible_devices:
os.environ['OverrideDefaultFP64Settings'] = '1'
os.environ['IGC_EnableDPEmulation'] = '1'
print(f"FP64 incompatible devices found: {incompatible_devices}")
print("Set OverrideDefaultFP64Settings and IGC_EnableDPEmulation to 1, FP64 maibe slower.")
set_xpu_environment()
@uniartisan Thanks for your feedback. You might get two issues due to lack of FP64 support on Arc.
tensor.to(dtype)
uses dynamic cast, where there is a switch-case for FP64. For the case, we are evaluating how to walk around it.Thanks!
@fengyuan14 : I see #511 got merged. However, I see that kernels compilation for Arc (build flag) was dropped from this PR. Can you, please, help to understand what's the plan for Arc? will such build flag be added and we should expect another PR or Arc will be supported via online kernels compilation?
We are investigating the bin size, building time and reasonability and compatibility if Arc AOT build is enabled. When all of these are acceptable, Arc AOT build will be added by default in build-from-source.
BTW, you also can enable Arc AOT build by TORCH_XPU_ARCH_LIST=
to override default targets.
As for binary release, I think there should be some wheel including Arc AOT. We are evaluating if the unified wheel including AOT build of all verified platform is acceptable, reasonable and compatible, or there is a separate wheel for Arc AOT.
For example, Arc wheels built on LTS driver (different IGC) may not work. So I am not sure of it, but guess,
Your can reach River or Eikan for the release distributions. @riverliuintel @EikanWang Please correct me.
For PyTorch 2.5 release, ARC with partial FP64 emulation will be support by source build on rolling driver. Whether includes ARC support in Torch 2.5 AOT binary release is a open question for now. We are closely pushing for this direction.
Confirmed. PyTorch 2.5 release binary will support ARC AOT with FP64 partial FP64 emulation.
🐛 Describe the bug
Description:
I've encountered issue while using the Intel XPU backend with PyTorch:
When trying to create a random tensor and convert it to bfloat16, I receive the following error:
This error suggests that the XPU device does not support fp64 operations. However, the
torch.randn()
function seems to be attempting to use fp64 internally before converting to the desired dtype.Proposed solution: Consider modifying the
torch.randn()
implementation to use fp32 as an intermediate type when fp64 is not supported on the device.Thank you for your attention to these matters. Let me know if you need any additional information.
Versions