M1 failing with `is not currently supported on the MPS backend...`

smblee commented 2 years ago

Followed the M1 instructions on Mac 12.5 version & python 3.10.4.

.../stable-diffusion/ldm/modules/embedding_manager.py:152: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at  /Users/runner/miniforge3/conda-bld/pytorch-recipe_1660136240338/work/aten/src/ATen/mps/MPSFallback.mm:11.)

and

.../stable-diffusion/ldm/modules/embedding_manager.py", line 155, in forward
    embedded_text[placeholder_idx] = placeholder_embedding
NotImplementedError: The operator 'aten::_index_put_impl_' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

I could try PYTORCH_ENABLE_MPS_FALLBACK but is that how people are getting around this issue?

tiems90 commented 2 years ago

PYTORCH_ENABLE_MPS_FALLBACK is already set within the environment in Anaconda, and does not seem to cover the operator 'aten::nonzero'.

owen109 commented 2 years ago

Followed the M1 instructions on Mac 12.5 version & python 3.10.4.

.../stable-diffusion/ldm/modules/embedding_manager.py:152: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at  /Users/runner/miniforge3/conda-bld/pytorch-recipe_1660136240338/work/aten/src/ATen/mps/MPSFallback.mm:11.)

and

.../stable-diffusion/ldm/modules/embedding_manager.py", line 155, in forward
    embedded_text[placeholder_idx] = placeholder_embedding
NotImplementedError: The operator 'aten::_index_put_impl_' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

I could try PYTORCH_ENABLE_MPS_FALLBACK but is that how people are getting around this issue?

im having the same issue it falls back to using the cpu, please update if you find a fix

paul-pro commented 2 years ago

After bumping to the same issue (and finding this thread) I've updated pytorch to the nightly build and it worked

@lstein @magnusviri, would it make sense to list pytorch-nightly in the environment-mac.yml?

rlaabs commented 2 years ago

Same issue with torch 1.13.0.dev20220901

adelsz commented 2 years ago

Confirm that issue persists with latest pytorch nightly 1.13.0.dev20220901. It also looks like the aten:non_zero op hasn't been implemented for the MPS backend in pytorch yet.

dustinlacewell commented 2 years ago

@adelsz does this mean for SD on M1 MPS it's a matter of time or is there a work around?

thomasaarholt commented 2 years ago

SD on M1 works fine. Use the environment-mac.yaml when creating your python environment with conda/mamba. I am running it right now on my M1 Macbook pro.

The warning containing aten::nonzero is still present, but the image generation works fine.

0rvar commented 2 years ago

It works, but the warning implies the inference is run on the CPU rather than the GPU.

thomasaarholt commented 2 years ago

Yes, at least whatever part of the code that uses nonzero. My Mac's GPU seems to be under 100% load during calls to SD, however. (See Activity Monitor -> Window -> GPU History)

Namor-Votilav commented 2 years ago

SD on M1 works fine. Use the environment-mac.yaml when creating your python environment with conda/mamba. I am running it right now on my M1 Macbook pro.

The warning containing aten::nonzero is still present, but the image generation works fine.

For me it isn't so, I get the warning and it does fall back to the CPU, therefore generation time becomes very long and I never seen it get past 20% on 1 iteration and 5 steps. How can I get it not to fall back to CPU?

M1 MBP 2020

gianpaj commented 2 years ago

Same here. Any tips on how to debug this?

$ git log

commit 751283a2de81bee4bb571fbabe4adb19f1d85b97 (HEAD -> main, origin/main, origin/HEAD)
Author: Kevin Gibbons <bakkot@gmail.com>
Date:   Sat Sep 3 23:34:20 2022 -0700

$ conda info

     active environment : ldm
    active env location : /Users/u/miniconda3/envs/ldm
            shell level : 1
       user config file : /Users/u/.condarc
 populated config files : /Users/u/.condarc
          conda version : 4.12.0
    conda-build version : not installed
         python version : 3.9.12.final.0
       virtual packages : __osx=12.5.1=0
                          __unix=0=0
                          __archspec=1=arm64
       base environment : /Users/u/miniconda3  (writable)
      conda av data dir : /Users/u/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/osx-arm64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-arm64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/u/miniconda3/pkgs
                          /Users/u/.conda/pkgs
       envs directories : /Users/u/miniconda3/envs
                          /Users/u/.conda/envs
               platform : osx-arm64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.12 Darwin/21.6.0 OSX/12.5.1
                UID:GID : 501:20
             netrc file : None
           offline mode : False

$ conda list | grep torch
pytorch                   1.13.0.dev20220903         py3.9_0    pytorch-nightly
pytorch-lightning         1.6.5              pyhd8ed1ab_0    conda-forge
torch-fidelity            0.3.0                    pypi_0    pypi
torchdiffeq               0.2.3                    pypi_0    pypi
torchmetrics              0.9.3              pyhd8ed1ab_0    conda-forge
torchvision               0.14.0.dev20220903        py39_cpu    pytorch-nightly

hardware

``` system_profiler SPSoftwareDataType SPHardwareDataType Software: System Software Overview: System Version: macOS 12.5.1 (21G83) Kernel Version: Darwin 21.6.0 Boot Volume: Macintosh HD Boot Mode: Normal Secure Virtual Memory: Enabled System Integrity Protection: Enabled Hardware: Hardware Overview: Model Name: MacBook Pro Model Identifier: MacBookPro18,3 Chip: Apple M1 Pro Total Number of Cores: 10 (8 performance and 2 efficiency) Memory: 16 GB Activation Lock Status: Enabled ```

Any tips on how to debug this?

thomasaarholt commented 2 years ago

I can't see anything obviously wrong with your log.

I installed using mamba (only because it's faster, but I guess theoretically this could impact it).

I've just run git pull and tried the whole installation process again, starting with conda env create -f environment-mac.yaml.

If you want to try an alternative, I've exported my environment file here. Copy to a file called thomasaarholt_env.yml.

Create a new environment with: conda env create -f thomasaarholt_env.yml (or mamba env ...)

Then I linked (or copied) the model downloaded from huggingface, and ran: python scripts/preload_models.py and

❯ python scripts/dream.py --full_precision # I just tested, and the --full_precision argument doesn't appear necessary
* Initializing, be patient...

>> cuda not available, using device mps
>> Loading model from models/ldm/stable-diffusion-v1/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using slower but more accurate full-precision math (--full_precision)
>> Setting Sampler to k_lms
>> model loaded in 9.38s

* Initialization done! Awaiting your command (-h for help, 'q' to quit)
dream> A monkey hacking into the NSA

underlow commented 2 years ago

@thomasaarholt I have same problem as OP. Just tried your thomasaarholt_env.yml

and got:

The following specifications were found to be incompatible with your system:

  - feature:/osx-arm64::__osx==12.4=0
  - feature:/osx-arm64::__unix==0=0
  - feature:|@/osx-arm64::__osx==12.4=0
  - feature:|@/osx-arm64::__unix==0=0
  - ipykernel==6.15.2=pyh736e0ef_0 -> __osx
  - ipykernel==6.15.2=pyh736e0ef_0 -> ipython[version='>=7.23.1'] -> __linux
  - ipykernel==6.15.2=pyh736e0ef_0 -> ipython[version='>=7.23.1'] -> __win
  - ipython==8.4.0=pyhd1c38e8_1 -> __osx
  - ipywidgets==8.0.2=pyhd8ed1ab_0 -> ipykernel[version='>=4.5.1'] -> __linux
  - ipywidgets==8.0.2=pyhd8ed1ab_0 -> ipykernel[version='>=4.5.1'] -> __osx
  - ipywidgets==8.0.2=pyhd8ed1ab_0 -> ipykernel[version='>=4.5.1'] -> __win
  - kornia==0.6.7=pyhd8ed1ab_0 -> pytorch[version='>=1.10'] -> __osx[version='>=11.0']
  - pydeck==0.7.1=pyh6c4a22f_0 -> ipykernel -> __linux
  - pydeck==0.7.1=pyh6c4a22f_0 -> ipykernel -> __osx
  - pydeck==0.7.1=pyh6c4a22f_0 -> ipykernel -> __win
  - pysocks==1.7.1=pyha2e5f31_6 -> __unix
  - pytorch-lightning==1.6.5=pyhd8ed1ab_0 -> pytorch[version='>=1.8'] -> __osx[version='>=11.0']
  - torchmetrics==0.9.3=pyhd8ed1ab_0 -> pytorch[version='>=1.3.1'] -> __osx[version='>=11.0']
  - urllib3==1.26.11=pyhd8ed1ab_0 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __unix
  - urllib3==1.26.11=pyhd8ed1ab_0 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __win

any idea what should I do with it?

Edit: maybe it's somehow related to python version. I have 3.9 and it's 3.10 in req file.

Still I do have same issue as OP and have no idea what to do with it

thomasaarholt commented 2 years ago

The env file should create a python environment with python 3.10. Whatever version you are using before creating the environment shouldn’t matter.

I can recommend trying to use mamba instead of conda. I have experienced different dependency resolution with it before. Try installing mamba in your conda environment according to the instructions, and then try creating the environment using mamba instead of conda.

https://mamba.readthedocs.io/en/latest/installation.html

underlow commented 2 years ago

No luck with mamba

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement k-diffusion==0.0.1 (from versions: none)
ERROR: No matching distribution found for k-diffusion==0.0.1

Edit: I've managed to install all dependencies with mamba. Now it fails with ModuleNotFoundError: No module named 'ldm'

Ok, I've won this fight with modules. And still getting that error and generation is extremely slow

Namor-Votilav commented 2 years ago

Could this performance issue be because I have only 8Gb of RAM? I've seen in multiple discussions that problems occur mostly within less-ram-equipped M1 machines. And if so, what can I tweak to improve the performance?

underlow commented 2 years ago

I've got 16Gb and still generation takes 5-10 minutes. There are a lot of users with 15-30 seconds (at least they told so) I was thinking slowness somehow related to fallback to cpu

funguy-tech commented 2 years ago

@underlow What version of MacOS are you on?

I am on the Ventura Beta, and when I updated to the latest build (22A5331f), I saw an immediate 5x performance boost. I am at ~1s/it on an M1 Pro with 16GB RAM for standard 512x512 photos (so 30 seconds for 30 steps, etc). I haven't been able to find any official documentation for why this speed boost would happen, but it's worth exploration.

For comparison, my M1 with 8GB on Monterey is in the 20s/it range.

underlow commented 2 years ago

Latest but not beta. I've retried clean-set up again several times and it works better, 3 min instead on 15.

MaximilianGaedig commented 2 years ago

I've got 16Gb and still generation takes 5-10 minutes. There are a lot of users with 15-30 seconds (at least they told so) I was thinking slowness somehow related to fallback to cpu

On an M1? My generations take 3mins aswell (M1 Pro). With one iteration I get 15secs but the result is nothing useful at all

underlow commented 2 years ago

On an M1? My generations take 3mins aswell (M1 Pro). With one iteration I get 15secs but the result is nothing useful at all

I've got 3-5 mins now. But it was average 15 and up to 30 sometimes. I thought it was somehow related to this error. Looks like not.

system1system2 commented 2 years ago

After following the installation instructions, I have the same warning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1659484612588/work/aten/src/ATen/mps/MPSFallback.mm:11.)

Despite that, the generation of a 512x512 image with 50 steps and a CFG scale of 7.5 (so all defaults) takes approx 60s and the GPU load approaches 85% for the python3.10 process in Activity Monitor during the whole process.

The system is a MBP 16" 2021 with M1 Pro and 32GB RAM hosting a macOS Monterey 12.5.1

I'm not sure how to make the error go away, but at least it works.

Riezebos commented 2 years ago

The source of the warning is ldm/modules/embedding_manager.py, I think this means that turning the prompt into embeddings is falling back to CPU. Generating the image could still be happening on the GPU. We can see whether there is a difference once aten::nonzero is checked in https://github.com/pytorch/pytorch/issues/77764 as mentioned above.

On my macbook air m1 16GB (2020) it takes 3 minutes to generate an image with the default settings of dream.py. That feels too fast to be generating the image on the CPU.

It might be possible to change embedding_manager to use something that has already been implemented in MPS instead of aten::nonzero but I have no idea how.

dperlman commented 2 years ago

I am getting that same warning but images are generating in well under a minute so I think it is using the GPU. Just one more data point...

ComputingVictor commented 1 week ago

Same warning for M3

invoke-ai / InvokeAI

M1 failing with `is not currently supported on the MPS backend...` #262