Open alexbaden opened 9 months ago
Also fail with v2.1.
demucs
fails only in training and it happens due to random numbers usage in training. Training passes on CPU but fails on XPU in both eager and inductor modes. Looks like this happens because RNG state is not reset properly for XPU between model runs.
There are at least two places where torch.manual_seed
is replaced with an implementation that is not XPU-enabled:
https://github.com/weishi-deng/benchmark/blob/main/torchbenchmark/util/env_check.py#L133
https://github.com/weishi-deng/benchmark/blob/main/userbenchmark/dynamo/dynamobench/common.py#L329
Fixing it lets demucs
pass on XPU in eager mode.
Here is a patch I use
diff --git a/torchbenchmark/util/env_check.py b/torchbenchmark/util/env_check.py
index 956fdb4f..bef4924b 100644
--- a/torchbenchmark/util/env_check.py
+++ b/torchbenchmark/util/env_check.py
@@ -125,6 +125,8 @@ def set_random_seed():
if not torch.cuda._is_in_bad_fork():
torch.cuda.manual_seed_all(seed)
+ if hasattr(torch, 'xpu') and not torch.xpu._is_in_bad_fork():
+ torch.xpu.manual_seed_all(seed)
return default_generator.manual_seed(seed)
torch.manual_seed(MAIN_RANDOM_SEED)
diff --git a/userbenchmark/dynamo/dynamobench/common.py b/userbenchmark/dynamo/dynamobench/common.py
index 831dfe06..1bf14e96 100644
--- a/userbenchmark/dynamo/dynamobench/common.py
+++ b/userbenchmark/dynamo/dynamobench/common.py
@@ -320,10 +320,17 @@ def patch_torch_manual_seed():
from torch._C import default_generator
seed = 1337
- import torch.cuda
- if not torch.cuda._is_in_bad_fork():
- torch.cuda.manual_seed_all(seed)
+ try:
+ import intel_extension_for_pytorch
+
+ if torch.xpu.is_available() and not torch.xpu._is_in_bad_fork():
+ torch.xpu.manual_seed_all(seed)
+ except:
+ import torch.cuda
+
+ if torch.cuda.is_available() and not torch.cuda._is_in_bad_fork():
+ torch.cuda.manual_seed_all(seed)
return default_generator.manual_seed(seed)
torch.manual_seed = deterministic_torch_manual_seed
When we run the benchmark with Inductor, random number generation goes through Triton kernel, and the seed used for this kernel is generated using aten.randint
operation. So, I believe eager mode and inductor would always produce different random number sequences because they would use different seeds and generation algorithms. It doesn't look like different backends have to be aligned in random number generation, so their results are simply incomparable.
Inductor kernel where the seed is generated and put to buf0
passed later to the Triton kernel:
def call(args):
buf0 = empty_strided((1, ), (1, ), device='xpu', dtype=torch.int64)
# Source Nodes: [], Original ATen: []
aten.randint.low_out(-9223372036854775808, 9223372036854775807, [1], out=buf0)
buf1 = empty_strided((4, 4, 1, 1), (4, 1, 1, 1), device='xpu', dtype=torch.int64)
# Source Nodes: [], Original ATen: []
stream0 = get_xpu_stream(0)
triton_poi_fused_0.run(buf0, buf1, 0, 16, grid=grid(16), stream=stream0)
return (buf1, )
Triton kernel where the seed (tmp0
) is used for randint64
:
def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr):
xnumel = 16
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + load_seed_offset)
tmp1 = x0
tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 44100)
tl.store(out_ptr0 + (x0), tmp2, xmask)
This issue is still reproduceable.
Env:
9a8ab778d34bd24c5caceb340837483decc4c311
fe93a00ffe438e9ba8c8392c0b051b1662c810de
d54ca9f80ead108c8797441681e219becaf963d8
1980f8af5bcd0bb2ce51965cf79d8d4c25dad8a0
10239873229e527f8b7e7b3340c40ee38bb1cfc4