intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
142 stars 43 forks source link

[torchbench][accuracy] functorch_dp_cifar10 accuracy check failed #460

Open alexbaden opened 9 months ago

alexbaden commented 9 months ago
» benchmarks/dynamo/torchbench.py --float32 -dxpu -n10 --no-skip --dashboard --training --inductor --accuracy --output /tmp/torchbench.csv --filter functorch_dp_cifar10

loading model: 0it [00:01, ?it/s]
xpu  train functorch_dp_cifar10               
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
skipping cudagraphs for unknown reason
[2024-02-05 21:45:17,997] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00122, (ref-fp64): 0.00000 and shape=torch.Size([64])
[2024-02-05 21:45:17,997] torch._dynamo.utils: [ERROR] Accuracy failed for key name bn1.bias.grad
fail_accuracy
whitneywhtsang commented 9 months ago

Also fail with v2.1.

ienkovich commented 8 months ago

I could reduce the original benchmark to a simple model with 4 layers:

class TestModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1, groups=1, bias=False, dilation=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=1, stride=2, bias=False)
        self.norm = nn.GroupNorm(32, 128)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.norm(x)
        x = self.relu(x)
        return x

example_inputs = torch.randn(4, 64, 4, 4)
...

For this model, training results differ for XPU in eager and inductor modes (but XPU eager matches CPU eager). Going through the code generated by TorchInductor, I found that the difference appears in the backward convolution which is in torch.ops.aten, so Triton is not involved. With this knowledge, I could write a simple test showing this operation works differently on CPU and XPU devices (all tensor sizes and convolution params match the reproducer above):

import torch
import intel_extension_for_pytorch
from torch._dynamo.testing import rand_strided, same

torch.manual_seed(1337)
arg1 = rand_strided((4, 128, 1, 1), (128, 1, 1, 1), device='cpu', dtype=torch.float32)
arg2 = rand_strided((4, 64, 2, 2), (256, 1, 128, 64), device='cpu', dtype=torch.float32)
arg3 = rand_strided((128, 64, 1, 1), (64, 1, 1, 1), device='cpu', dtype=torch.float32)

def run_conv_bwd(arg1, arg2, arg3, device):
    arg1_dev = arg1.to(device)
    arg2_dev = arg2.to(device)
    arg3_dev = arg3.to(device)
    res = torch.ops.aten.convolution_backward(arg1_dev, arg2_dev, arg3_dev, [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False])
    res = tuple(v.to('cpu') if v is not None else v for v in res)
    return res

cpu_res = run_conv_bwd(arg1, arg2, arg3, 'cpu')
xpu_res = run_conv_bwd(arg1, arg2, arg3, 'xpu')

print(f"CPU result:\n{cpu_res}")
print(f"XPU result:\n{xpu_res}")

assert(same(cpu_res, xpu_res))
vlad-penkin commented 5 months ago

The issue is still reproducible.

Env: