tinygrad / tinygrad

You like pytorch? You like micrograd? You love tinygrad! ❤️
MIT License
26.65k stars 2.95k forks source link

WINO=1 python examples/beautiful_mnist.py has lower test_accuracy #3857

Open chenyuxyz opened 7 months ago

chenyuxyz commented 7 months ago

I can hit RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) on M1 Max pretty consistently with WINO=1 PYTHONPATH=. python examples/beautiful_mnist.py. Seems fine with JIT=2. Also with WINO=1 when it runs it only trains to 95%

geohot commented 7 months ago

I don't see the crash on M3, but I do see the low accuracy.

chenyuxyz commented 7 months ago

weird, rebooted and it does not crash. gpu was likely in a bad state.

update the issue for low accuracy

michbogos commented 7 months ago

I've also seen lower accuracy with IMAGE=1.

chenyuxyz commented 6 months ago

WINO and IMAGE both are affected by LAZYCACHE with JIT, and eval step is not properly captured by jit.

As a workaround either commenting out TinyJit of get_test_acc, or adding LAZYCACHE=0 works.

evilsocket commented 3 months ago

Can confirm this happening on M1 Max and keeps happening after reboot:

loss:   0.43 test_accuracy:   nan%:  13%|███████████████                                                                                                       | 9/70 [00:02<00:15,  4.00it/s]
Traceback (most recent call last):
  File "/Users/evilsocket/lab/tinygrad_mnist.py", line 43, in <module>
    if i%10 == 9: test_acc = get_test_acc().item()
                             ^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/jit.py", line 153, in __call__
    if len(params:=get_parameters(self.ret)): Tensor.realize(params[0], *params[1:])
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 201, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 192, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 153, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 116, in __call__
    self.copy(dest, src)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 111, in copy
    dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/device.py", line 117, in copyin
    self.allocator.copyin(self._buf, mv)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 88, in copyin
    def copyin(self, dest:Any, src:memoryview): self.as_buffer(dest)[:] = src
                                                ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 86, in as_buffer
    self.device.synchronize()
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 102, in synchronize
    for cbuf in self.mtl_buffers_in_flight: wait_check(cbuf)
                                            ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 12, in wait_check
    raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x6000014c1470 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}

Even with LAZYCACHE=0:

LAZYCACHE=0 python3.11 tinygrad_mnist.py                                                                                                                              ~/lab 1 ↵ 
loss:   1.24 test_accuracy:   nan%:   4%|█████                                                                                                                 | 3/70 [00:01<00:38,  1.72it/s]
Traceback (most recent call last):
  File "/Users/evilsocket/lab/tinygrad_mnist.py", line 44, in <module>
    t.set_description(f"loss: {loss.item():6.2f} test_accuracy: {test_acc:5.2f}%")
                               ^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 271, in item
    return self._data().cast(self.dtype.fmt)[0]
           ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 242, in _data
    cpu = self.cast(self.dtype.scalar()).contiguous().to("CLANG").realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/tensor.py", line 201, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 192, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 153, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 116, in __call__
    self.copy(dest, src)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 111, in copy
    dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/device.py", line 110, in as_buffer
    if (force_zero_copy or allow_zero_copy) and hasattr(self.allocator, 'as_buffer'): return self.allocator.as_buffer(self._buf)
                                                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 86, in as_buffer
    self.device.synchronize()
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 102, in synchronize
    for cbuf in self.mtl_buffers_in_flight: wait_check(cbuf)
                                            ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 12, in wait_check
    raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x600003a4a4f0 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}

JIT=2 works tho:


JIT=2 python3.11 tinygrad_mnist.py                                                                                                                                    ~/lab 1 ↵ 
loss:   0.06 test_accuracy: 98.33%: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:06<00:00, 10.65it/s]```
evilsocket commented 3 months ago

For completeness I'm attaching a DEBUG=5 complete log of when this crash is happening:

Traceback (most recent call last):
  File "/Users/evilsocket/lab/tinygrad_mnist.py", line 42, in <module>
    loss = train_step()
           ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/jit.py", line 192, in __call__
    for ei in self.jit_cache: ei.run(var_vals, jit=True)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/engine/realize.py", line 153, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/graph/metal.py", line 72, in __call__
    wait_check(command_buffer)
  File "/opt/homebrew/lib/python3.11/site-packages/tinygrad/runtime/ops_metal.py", line 12, in wait_check
    raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x6000012978a0 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}

This is from a 2021 MacBook Pro, M1 MAX 64GB, macOS 14.5

debug.log