ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.18k stars 876 forks source link

The stable diffusion example encountered an error after being upgraded to version 0.4. #492

Closed haoliplus closed 8 months ago

haoliplus commented 8 months ago

commannd(mlx-example: 47dd6bd17f3cc7ef95672ea16e443e58ce5eb1bf)

python txt2image.py "A photo of an astronaut riding a horse on Mars." --n_images 1 --n_rows 1 --steps 1

error(all):

$ python txt2image.py "A photo of an astronaut riding a horse on Mars." --n_images 1 --n_rows 1 --steps 1

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.25s/it]
  0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?it/s]
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Invalid Resource (00000009:kIOGPUCommandBufferCallbackErrorInvalidResource)
[1]    7474 abort      python txt2image.py "A photo of an astronaut riding a horse on Mars."  1  1
/Users/xxx/.pyenv/versions/3.10.13/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

My environment:

  1. mac mini, m2 pro
  2. macOS: 14.3.1
  3. python: 3.10.13
  4. mlx: 0.4.0
  5. clang:
    Apple clang version 15.0.0 (clang-1500.1.0.2.5)
    Target: arm64-apple-darwin23.3.0
    Thread model: posix
    InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I attempted to locate the error by adding a log, and I found it in vae.py:Decoder:__call__. However, when I tried to investigate further by adding another log(add print(x) at this line), the error disappeared. It seems that some variable is released too early?

awni commented 8 months ago

I'm attempting to reproduce the issue. It did not fail with the same command on an M1 Max. How much RAM does your M2 pro have?

awni commented 8 months ago

It did fail on an M1 mini with 8 GB...presumably it's an OOM issue but the message is not helpful

haoliplus commented 8 months ago

I'm attempting to reproduce the issue. It did not fail with the same command on an M1 Max. How much RAM does your M2 pro have?

The memory of my M2 pro is 16GB. This error does not occur in mlx0.3, suggesting that memory may not be the main cause, I think. And I have two devices(16GB M2 air and 16GB M2 Pro mini). The error occurs on both devices.

awni commented 8 months ago

Hmm actually I was able to repro the bug in 0.3 and 0.4. I believe the fix is in https://github.com/ml-explore/mlx/pull/752