[Bug]: Slow/No acceleration on Snapdragon 8gen3 CPUs / SNPE

TsengSR commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I tried this on the Windows 2023 SDK Kit (aka Project Voltera) which is an ARM CPU with an Snapdragon 8gen3 CPU which also supports NPU acceleration.

Quallcomm demonstrated Stable Diffusion running on an Snapdragon 8gen2 (previous generation CPU) generating 512x512 images with 20 steps in 15 seconds (source: https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android).

But running webui-directml needs 6 minutes for a simple 512x512 image with 20 steps, way off from the 15 seconds possible on previous generation hardware (I'd expect at least 10-12 second on this hardware), so the hardware definitely is capable of running stable diffusion at an acceptable speed.

Onnx also offers SNPE (Snapdragon Neural Processing Engine)/QNN (Quallcom Neural Netzwork)

Steps to reproduce the problem

Clean Checkout of stable-diffusion-webui-directml on an Device with Snapdragon 8gen3 (i.e. Windows 2023 SDK Kit aka Project Voltera)
Run webui-user.bat and wait for the install to complete
Run any prompt

What should have happened?

Expected the prompt to execute within 10-20 seconds.

Since onnx also supports SNPE/QNN I expect that it also works with this since it has onnx support too.

Version or Commit where the problem happens

265d626471eacd617321bdb51e50e4b87a7ca82e

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Windows

What device are you running WebUI on?

No response

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Microsoft Edge

Command Line Arguments

Same results with default (no arguments) and `--lowvram --no-half --no-half-vae --opt-sub-quad-attention --opt-split-attention --opt-split-attention-v1 --disable-nan-check`.

List of extensions

Default ones

Console logs

fatal: No names found, cannot describe anything.
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: ## 1.4.0
Commit hash: 265d626471eacd617321bdb51e50e4b87a7ca82e
Installing requirements
Launching Web UI with arguments: --lowvram --no-half --no-half-vae --opt-sub-quad-attention --opt-split-attention --opt-split-attention-v1 --disable-nan-check
No module 'xformers'. Proceeding without it.
Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled
Loading weights [6ce0161689] from C:\stable-diffusion-webui-directml\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors
preload_extensions_git_metadata for 7 extensions took 0.00s
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 12.2s (import torch: 3.5s, import gradio: 2.4s, import ldm: 1.1s, other imports: 1.8s, setup codeformer: 0.1s, load scripts: 1.9s, create ui: 0.9s, gradio launch: 0.3s).
Creating model from config: C:\stable-diffusion-webui-directml\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying attention optimization: Doggettx... done.
Textual inversion embeddings loaded(0):
Model loaded in 10.1s (load weights from disk: 1.5s, create model: 6.2s, apply weights to model: 1.5s, calculate empty prompt: 0.8s).
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [06:06<00:00, 18.35s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [05:59<00:00, 17.99s/it]
Applying attention optimization: sdp... done.██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [05:59<00:00, 17.98s/it]

Additional information

No response

lshqqytiger commented 1 year ago

At this time, upstream mainly targets x86 and this fork does. But I'm interested in running sd on mobile devices. I wonder torch uses GPU because torch-directml does not have ARM build.

ClashSAN commented 1 year ago

Lol! Cool experiment!

@TsengSR many of the arguments you are currently using overlap each other.

In the ui settings, disable automatic cross attention optimization. It is setting: "Applying attention optimization: sdp..."

Please do not use --lowvram and --no-half (full precision). These are too slow.

--opt-sub-quad-attention --opt-split-attention --opt-split-attention-v1

Pick one individually during your testing.

This is not running ONNX.

ClashSAN commented 1 year ago

@lshqqytiger

But I'm interested in running sd on mobile devices.

https://pixlab.io/tiny-dream maybe try playing with this for cpu for Android whenever it is released. IOS already has good support.

spcharc commented 8 months ago

Your chip is 8cx gen 3. Its announcement date is Dec 02, 2021.

8 gen 2's announcement date: Nov 16, 2022.

8 gen 3's announcement date: Oct 24, 2023.

TsengSR commented 8 months ago

Your chip is 8cx gen 3. Its announcement date is Dec 02, 2021.

8 gen 2's announcement date: Nov 16, 2022.

8 gen 3's announcement date: Oct 24, 2023.

Seems like an oversight on my end. Still, the SoC has 15 TOPS of NPU (vs. 30 of the 8gen3) acceleration and also an graphic card of course and in the test above it was like ~20 sec/it, and way of 1.3 it/s (20 steps in 15 sec) of the 8gen3. It's clearly not utilizing the SoC and it's NPU fully or at all

lshqqytiger / stable-diffusion-webui-amdgpu