Closed TsengSR closed 7 months ago
At this time, upstream mainly targets x86 and this fork does.
But I'm interested in running sd on mobile devices.
I wonder torch
uses GPU because torch-directml
does not have ARM build.
Lol! Cool experiment!
@TsengSR many of the arguments you are currently using overlap each other.
In the ui settings, disable automatic cross attention optimization. It is setting: "Applying attention optimization: sdp..."
Please do not use --lowvram and --no-half (full precision). These are too slow.
--opt-sub-quad-attention --opt-split-attention --opt-split-attention-v1
Pick one individually during your testing.
This is not running ONNX.
@lshqqytiger
But I'm interested in running sd on mobile devices.
https://pixlab.io/tiny-dream maybe try playing with this for cpu for Android whenever it is released. IOS already has good support.
Your chip is 8cx gen 3. Its announcement date is Dec 02, 2021.
8 gen 2's announcement date: Nov 16, 2022.
8 gen 3's announcement date: Oct 24, 2023.
Your chip is 8cx gen 3. Its announcement date is Dec 02, 2021.
8 gen 2's announcement date: Nov 16, 2022.
8 gen 3's announcement date: Oct 24, 2023.
Seems like an oversight on my end. Still, the SoC has 15 TOPS of NPU (vs. 30 of the 8gen3) acceleration and also an graphic card of course and in the test above it was like ~20 sec/it, and way of 1.3 it/s (20 steps in 15 sec) of the 8gen3. It's clearly not utilizing the SoC and it's NPU fully or at all
Is there an existing issue for this?
What happened?
I tried this on the Windows 2023 SDK Kit (aka Project Voltera) which is an ARM CPU with an Snapdragon 8gen3 CPU which also supports NPU acceleration.
Quallcomm demonstrated Stable Diffusion running on an Snapdragon 8gen2 (previous generation CPU) generating 512x512 images with 20 steps in 15 seconds (source: https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android).
But running webui-directml needs 6 minutes for a simple 512x512 image with 20 steps, way off from the 15 seconds possible on previous generation hardware (I'd expect at least 10-12 second on this hardware), so the hardware definitely is capable of running stable diffusion at an acceptable speed.
Onnx also offers SNPE (Snapdragon Neural Processing Engine)/QNN (Quallcom Neural Netzwork)
Steps to reproduce the problem
What should have happened?
Expected the prompt to execute within 10-20 seconds.
Since onnx also supports SNPE/QNN I expect that it also works with this since it has onnx support too.
Version or Commit where the problem happens
265d626471eacd617321bdb51e50e4b87a7ca82e
What Python version are you running on ?
Python 3.10.x
What platforms do you use to access the UI ?
Windows
What device are you running WebUI on?
No response
Cross attention optimization
Automatic
What browsers do you use to access the UI ?
Microsoft Edge
Command Line Arguments
List of extensions
Default ones
Console logs
Additional information
No response