Open chenyuxyz opened 7 months ago
I don't see the crash on M3, but I do see the low accuracy.
weird, rebooted and it does not crash. gpu was likely in a bad state.
update the issue for low accuracy
I've also seen lower accuracy with IMAGE=1.
WINO and IMAGE both are affected by LAZYCACHE
with JIT
, and eval step is not properly captured by jit.
As a workaround either commenting out TinyJit
of get_test_acc
, or adding LAZYCACHE=0
works.
Can confirm this happening on M1 Max and keeps happening after reboot:
loss: 0.43 test_accuracy: nan%: 13%|███████████████ | 9/70 [00:02<00:15, 4.00it/s]
Traceback (most recent call last):
File "/Users/evilsocket/lab/tinygrad_mnist.py", line 43, in <module>
if i%10 == 9: test_acc = get_test_acc().item()
^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/jit.py", line 153, in __call__
if len(params:=get_parameters(self.ret)): Tensor.realize(params[0], *params[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 201, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 192, in run_schedule
ei.run(var_vals, do_update_stats=do_update_stats)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 153, in run
et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 116, in __call__
self.copy(dest, src)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 111, in copy
dest.copyin(src.as_buffer(allow_zero_copy=True)) # may allocate a CPU buffer depending on allow_zero_copy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/device.py", line 117, in copyin
self.allocator.copyin(self._buf, mv)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 88, in copyin
def copyin(self, dest:Any, src:memoryview): self.as_buffer(dest)[:] = src
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 86, in as_buffer
self.device.synchronize()
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 102, in synchronize
for cbuf in self.mtl_buffers_in_flight: wait_check(cbuf)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 12, in wait_check
raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x6000014c1470 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}
Even with LAZYCACHE=0:
LAZYCACHE=0 python3.11 tinygrad_mnist.py ~/lab 1 ↵
loss: 1.24 test_accuracy: nan%: 4%|█████ | 3/70 [00:01<00:38, 1.72it/s]
Traceback (most recent call last):
File "/Users/evilsocket/lab/tinygrad_mnist.py", line 44, in <module>
t.set_description(f"loss: {loss.item():6.2f} test_accuracy: {test_acc:5.2f}%")
^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 271, in item
return self._data().cast(self.dtype.fmt)[0]
^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 242, in _data
cpu = self.cast(self.dtype.scalar()).contiguous().to("CLANG").realize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 201, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 192, in run_schedule
ei.run(var_vals, do_update_stats=do_update_stats)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 153, in run
et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 116, in __call__
self.copy(dest, src)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 111, in copy
dest.copyin(src.as_buffer(allow_zero_copy=True)) # may allocate a CPU buffer depending on allow_zero_copy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/device.py", line 110, in as_buffer
if (force_zero_copy or allow_zero_copy) and hasattr(self.allocator, 'as_buffer'): return self.allocator.as_buffer(self._buf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 86, in as_buffer
self.device.synchronize()
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 102, in synchronize
for cbuf in self.mtl_buffers_in_flight: wait_check(cbuf)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 12, in wait_check
raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x600003a4a4f0 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}
JIT=2 works tho:
JIT=2 python3.11 tinygrad_mnist.py ~/lab 1 ↵
loss: 0.06 test_accuracy: 98.33%: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:06<00:00, 10.65it/s]```
For completeness I'm attaching a DEBUG=5 complete log of when this crash is happening:
Traceback (most recent call last):
File "/Users/evilsocket/lab/tinygrad_mnist.py", line 42, in <module>
loss = train_step()
^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/jit.py", line 192, in __call__
for ei in self.jit_cache: ei.run(var_vals, jit=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 153, in run
et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/graph/metal.py", line 72, in __call__
wait_check(command_buffer)
File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 12, in wait_check
raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x6000012978a0 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}
This is from a 2021 MacBook Pro, M1 MAX 64GB, macOS 14.5
I can hit
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery)
on M1 Max pretty consistently withWINO=1 PYTHONPATH=. python examples/beautiful_mnist.py
. Seems fine withJIT=2
. Also withWINO=1
when it runs it only trains to 95%