intel / torch-xpu-ops

Apache License 2.0
30 stars 21 forks source link

Logcumsumexp has different results between CPU and XPU on BF16/Complex64/Complex128 #1012

Closed LuFinch closed 1 week ago

LuFinch commented 1 month ago

🐛 Describe the bug

Mismatched elements: 2 / 125 (1.6%) Greatest absolute difference: 0.03125 at index (1, 4, 2) (up to 0.001 allowed) Greatest relative difference: 0.006072998046875 at index (2, 3, 1) (up to 0.001 allowed)

cpu output at (1, 4, 2): tensor(6.1875, dtype=torch.bfloat16) xpu output at (1, 4, 2): tensor(6.1562, device='xpu:0', dtype=torch.bfloat16)


- Complex128

PYTORCH_TEST_WITH_SLOW=1 python test/xpu/extended/test_ops_xpu.py TestCommonXPU.test_compare_cpu_logcumsumexp_xpu_complex128

Mismatched elements: 2 / 125 (1.6%) Greatest absolute difference: 12.566370614359174 at index (3, 3, 0) (up to 0.001 allowed) Greatest relative difference: 1.5103243157406059 at index (3, 4, 0) (up to 0.001 allowed)

cpu output at (3, 3, 0): tensor(7.4356+3.7336j, dtype=torch.complex128) xpu output at (3, 3, 0): tensor(7.4356-8.8328j, device='xpu:0', dtype=torch.complex128)


- Complex64

test_reductions_xpu.py::TestReductionsXPU::test_logcumsumexp_complex_xpu_complex64

Mismatched elements: 1 / 3 (33.3%) Greatest absolute difference: nan at index (2,) (up to 1e-05 allowed) Greatest relative difference: nan at index (2,) (up to 1.3e-06 allowed)

input : [1e3 + 0j, 1e-18 + 1e4j, 1e2 + 1e-8j] cpu_output : [1000.+0.j, 1000.+0.j, 1000.+0.j] cuda_output : [1000.+0.j, 1000.+0.j, 1000.+0.j] xpu_output : [1000.+0.j, 1000.+0.j, nan + nanj]



For complex64, I found that the nan issue in complex64 is caused by accumulated order: our xpu scan kernel would firstly reduce input[1], input[2], then reduce input[0], input[2] in this case. However, even cpu kernel will output [nan, nanj] when directly calculating logcumsumexp(input[1], input[2]). 

### Versions

Related PR: https://github.com/intel/torch-xpu-ops/pull/931