[torchbench][accuracy] demucs accuracy check failed

alexbaden commented 9 months ago

» benchmarks/dynamo/torchbench.py --float32 -dxpu -n10 --no-skip --dashboard --training --inductor --accuracy --output /tmp/torchbench.csv --filter demucs

loading model: 0it [00:05, ?it/s]
xpu  train demucs                             
WARNING:common:fp64 golden ref were not generated for demucs. Setting accuracy check to cosine
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
[2024-02-05 21:43:53,107] torch._dynamo.utils: [WARNING] Similarity score=0.00015774701023474336
fail_accuray

whitneywhtsang commented 9 months ago

Also fail with v2.1.

ienkovich commented 9 months ago

demucs fails only in training and it happens due to random numbers usage in training. Training passes on CPU but fails on XPU in both eager and inductor modes. Looks like this happens because RNG state is not reset properly for XPU between model runs.

There are at least two places where torch.manual_seed is replaced with an implementation that is not XPU-enabled: https://github.com/weishi-deng/benchmark/blob/main/torchbenchmark/util/env_check.py#L133 https://github.com/weishi-deng/benchmark/blob/main/userbenchmark/dynamo/dynamobench/common.py#L329 Fixing it lets demucs pass on XPU in eager mode.

Here is a patch I use

diff --git a/torchbenchmark/util/env_check.py b/torchbenchmark/util/env_check.py
index 956fdb4f..bef4924b 100644
--- a/torchbenchmark/util/env_check.py
+++ b/torchbenchmark/util/env_check.py
@@ -125,6 +125,8 @@ def set_random_seed():

         if not torch.cuda._is_in_bad_fork():
             torch.cuda.manual_seed_all(seed)
+        if hasattr(torch, 'xpu') and not torch.xpu._is_in_bad_fork():
+            torch.xpu.manual_seed_all(seed)
         return default_generator.manual_seed(seed)

     torch.manual_seed(MAIN_RANDOM_SEED)
diff --git a/userbenchmark/dynamo/dynamobench/common.py b/userbenchmark/dynamo/dynamobench/common.py
index 831dfe06..1bf14e96 100644
--- a/userbenchmark/dynamo/dynamobench/common.py
+++ b/userbenchmark/dynamo/dynamobench/common.py
@@ -320,10 +320,17 @@ def patch_torch_manual_seed():
         from torch._C import default_generator

         seed = 1337
-        import torch.cuda

-        if not torch.cuda._is_in_bad_fork():
-            torch.cuda.manual_seed_all(seed)
+        try:
+            import intel_extension_for_pytorch
+
+            if torch.xpu.is_available() and not torch.xpu._is_in_bad_fork():
+                torch.xpu.manual_seed_all(seed)
+        except:
+            import torch.cuda
+
+            if torch.cuda.is_available() and not torch.cuda._is_in_bad_fork():
+                torch.cuda.manual_seed_all(seed)
         return default_generator.manual_seed(seed)

     torch.manual_seed = deterministic_torch_manual_seed

ienkovich commented 9 months ago

When we run the benchmark with Inductor, random number generation goes through Triton kernel, and the seed used for this kernel is generated using aten.randint operation. So, I believe eager mode and inductor would always produce different random number sequences because they would use different seeds and generation algorithms. It doesn't look like different backends have to be aligned in random number generation, so their results are simply incomparable.

Inductor kernel where the seed is generated and put to buf0 passed later to the Triton kernel:

def call(args):
    buf0 = empty_strided((1, ), (1, ), device='xpu', dtype=torch.int64)
    # Source Nodes: [], Original ATen: []
    aten.randint.low_out(-9223372036854775808, 9223372036854775807, [1], out=buf0)
    buf1 = empty_strided((4, 4, 1, 1), (4, 1, 1, 1), device='xpu', dtype=torch.int64)
    # Source Nodes: [], Original ATen: []
    stream0 = get_xpu_stream(0)
    triton_poi_fused_0.run(buf0, buf1, 0, 16, grid=grid(16), stream=stream0)
    return (buf1, )

Triton kernel where the seed (tmp0) is used for randint64:

def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + load_seed_offset)
    tmp1 = x0
    tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 44100)
    tl.store(out_ptr0 + (x0), tmp2, xmask)

vlad-penkin commented 5 months ago

This issue is still reproduceable.

Env:

pytorch is built from source, top of the main trunk, commit_id - 9a8ab778d34bd24c5caceb340837483decc4c311
triton xpu is built from source, top of the main trunk, commit_id - fe93a00ffe438e9ba8c8392c0b051b1662c810de
benchmark is built from source, top of the main trunk, commit_id - d54ca9f80ead108c8797441681e219becaf963d8
torchaudio is built from source, top of the main trunk, commit_id - 1980f8af5bcd0bb2ce51965cf79d8d4c25dad8a0
torchvision is built from source, top of the main trunk, commit_id - 10239873229e527f8b7e7b3340c40ee38bb1cfc4
PyTorch Dependency Bundle 0.5.0
Latest Rolling Driver

intel / intel-xpu-backend-for-triton

[torchbench][accuracy] demucs accuracy check failed #459