Potential regression: torch.autograd.backward now requires FP64 on XPU - invalid_kernel("uses-fp64-math")

substanc3-dev commented 1 year ago

Describe the bug

It seems as though some change between v1.10.200+gpu and 1.13.10+xpu caused previously working code to fine-tune a Flan-T5 model to no longer work due to an FP64 requirement (which is not supported on Flex). This code previously worked on v1.10.200+gpu (inside the intel/intel-extension-for-pytorch:gpu docker container containing this version), however this no longer works inside the latest image on tag intel/intel-extension-for-pytorch:xpu-flex with the 1.13.10+xpu version. Assuming a potential culprit could be the added CPU support causing the application to resort to FP64 which is supported on those platforms, however I wasn't able to investigate super deep, so that's just a guess. Any workarounds or fixes would be appreciated.

Reproducible example: https://gist.github.com/substanc3-dev/1f497b2a308b7dc84fa5fc3f32fab759

The container is being run inside Docker Desktop on Windows 11 (22H2 retail non-insider 22621.1265) with the 31.0.101.4146 driver installed.

The full error:

RuntimeError                              Traceback (most recent call last)
Cell In[13], line 1
----> 1 training_function(model)

Cell In[12], line 24, in training_function(model)
     23 loss = outputs['loss']
---> 24 loss.backward()
     25 optimizer.step()
     26 lr_scheduler.step()

File /usr/local/lib/python3.10/dist-packages/torch/_tensor.py:487, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    477 if has_torch_function_unary(self):
    478     return handle_torch_function(
    479         Tensor.backward,
    480         (self,),
   (...)
    485         inputs=inputs,
    486     )
--> 487 torch.autograd.backward(
    488     self, gradient, retain_graph, create_graph, inputs=inputs
    489 )

File /usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:197, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    192     retain_graph = create_graph
    194 # The reason we repeat same the comment below is that
    195 # some Python versions print out the first line of a multi-line function
    196 # calls in the traceback and some print out the last line
--> 197 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    198     tensors, grad_tensors_, retain_graph, create_graph, inputs,
    199     allow_unreachable=True, accumulate_grad=True)

RuntimeError: Native API failed. Native API returns: -996 (Function exists but address is not available)
invalid_kernel("uses-fp64-math")
 -996 (Function exists but address is not available)

Versions

[W OperatorEntry.cpp:150] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: torchvision::nms
    no debug info
  dispatch key: CPU
  previous kernel: registered at /build/intel-pytorch-extension/csrc/cpu/aten/TorchVisionNms.cpp:47
       new kernel: registered at /opt/workspace/vision/torchvision/csrc/ops/cpu/nms_kernel.cpp:112 (function registerKernel)
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration:
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=13004MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          24
On-line CPU(s) list:             0-23
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 9 5900X 12-Core Processor
CPU family:                      25
Model:                           33
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
Stepping:                        0
BogoMIPS:                        7400.05
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       384 KiB (12 instances)
L1i cache:                       384 KiB (12 instances)
L2 cache:                        6 MiB (12 instances)
L3 cache:                        32 MiB (1 instance)
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A

jingxu10 commented 1 year ago

Thanks for reporting this issue. We will look into it.

jingxu10 commented 1 year ago

issue seems to be from tanhBackward kernel within NewGeluActivation, We will look into it further.

vishnumadhu365 commented 1 year ago

The issue will be fixed in the next release of IPEX XPU

vishnumadhu365 commented 1 year ago

@substanc3-dev meanwhile if it helps, you could try the latest version by building from source. Below references should help:

deepglugs commented 1 year ago

The issue will be fixed in the next release of IPEX XPU

When is the next release for XPU? Last release was almost 4 months ago.

jingxu10 commented 1 year ago

It will be taking some time before next release. Currently, you can compile from source with https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/scripts/compile_bundle.sh please try with oneapi basekit 2023.1 and the latest driver.

intel / intel-extension-for-pytorch

Potential regression: torch.autograd.backward now requires FP64 on XPU - invalid_kernel("uses-fp64-math") #307

Describe the bug

Versions