intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

[torchbench][accuracy] demucs accuracy check failed #459

Open alexbaden opened 9 months ago

alexbaden commented 9 months ago
» benchmarks/dynamo/torchbench.py --float32 -dxpu -n10 --no-skip --dashboard --training --inductor --accuracy --output /tmp/torchbench.csv --filter demucs

loading model: 0it [00:05, ?it/s]
xpu  train demucs                             
WARNING:common:fp64 golden ref were not generated for demucs. Setting accuracy check to cosine
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
[2024-02-05 21:43:53,107] torch._dynamo.utils: [WARNING] Similarity score=0.00015774701023474336
fail_accuray
whitneywhtsang commented 9 months ago

Also fail with v2.1.

ienkovich commented 9 months ago

demucs fails only in training and it happens due to random numbers usage in training. Training passes on CPU but fails on XPU in both eager and inductor modes. Looks like this happens because RNG state is not reset properly for XPU between model runs.

There are at least two places where torch.manual_seed is replaced with an implementation that is not XPU-enabled: https://github.com/weishi-deng/benchmark/blob/main/torchbenchmark/util/env_check.py#L133 https://github.com/weishi-deng/benchmark/blob/main/userbenchmark/dynamo/dynamobench/common.py#L329 Fixing it lets demucs pass on XPU in eager mode.

Here is a patch I use

diff --git a/torchbenchmark/util/env_check.py b/torchbenchmark/util/env_check.py
index 956fdb4f..bef4924b 100644
--- a/torchbenchmark/util/env_check.py
+++ b/torchbenchmark/util/env_check.py
@@ -125,6 +125,8 @@ def set_random_seed():

         if not torch.cuda._is_in_bad_fork():
             torch.cuda.manual_seed_all(seed)
+        if hasattr(torch, 'xpu') and not torch.xpu._is_in_bad_fork():
+            torch.xpu.manual_seed_all(seed)
         return default_generator.manual_seed(seed)

     torch.manual_seed(MAIN_RANDOM_SEED)
diff --git a/userbenchmark/dynamo/dynamobench/common.py b/userbenchmark/dynamo/dynamobench/common.py
index 831dfe06..1bf14e96 100644
--- a/userbenchmark/dynamo/dynamobench/common.py
+++ b/userbenchmark/dynamo/dynamobench/common.py
@@ -320,10 +320,17 @@ def patch_torch_manual_seed():
         from torch._C import default_generator

         seed = 1337
-        import torch.cuda

-        if not torch.cuda._is_in_bad_fork():
-            torch.cuda.manual_seed_all(seed)
+        try:
+            import intel_extension_for_pytorch
+
+            if torch.xpu.is_available() and not torch.xpu._is_in_bad_fork():
+                torch.xpu.manual_seed_all(seed)
+        except:
+            import torch.cuda
+
+            if torch.cuda.is_available() and not torch.cuda._is_in_bad_fork():
+                torch.cuda.manual_seed_all(seed)
         return default_generator.manual_seed(seed)

     torch.manual_seed = deterministic_torch_manual_seed
ienkovich commented 9 months ago

When we run the benchmark with Inductor, random number generation goes through Triton kernel, and the seed used for this kernel is generated using aten.randint operation. So, I believe eager mode and inductor would always produce different random number sequences because they would use different seeds and generation algorithms. It doesn't look like different backends have to be aligned in random number generation, so their results are simply incomparable.

Inductor kernel where the seed is generated and put to buf0 passed later to the Triton kernel:

def call(args):
    buf0 = empty_strided((1, ), (1, ), device='xpu', dtype=torch.int64)
    # Source Nodes: [], Original ATen: []
    aten.randint.low_out(-9223372036854775808, 9223372036854775807, [1], out=buf0)
    buf1 = empty_strided((4, 4, 1, 1), (4, 1, 1, 1), device='xpu', dtype=torch.int64)
    # Source Nodes: [], Original ATen: []
    stream0 = get_xpu_stream(0)
    triton_poi_fused_0.run(buf0, buf1, 0, 16, grid=grid(16), stream=stream0)
    return (buf1, )

Triton kernel where the seed (tmp0) is used for randint64:

def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + load_seed_offset)
    tmp1 = x0
    tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 44100)
    tl.store(out_ptr0 + (x0), tmp2, xmask)
vlad-penkin commented 5 months ago

This issue is still reproduceable.

Env: