First let me thank you @lllyasviel and the whole dev team 4 this great product you have made.
I have tested and compared nearly all Stable Diffusion products out there on the open source market for 2 months now. I've run them all on Intel CPU only, so no GPU testing. Here are my results and the issues i have encountered while running the tests:
Stable Diffusion Webui: slow rendering 8 minutes for 1024x1024 image in 20 steps, very high virtual memory allocation
Automatic: fork based on SD-webui same issues there
Invoke AI: very hard to set this up and get it running. 2 minutes for 1024x1024 image in 20 steps, low memory consumption and low virtual memory allocation. Not so nice user interface. Outpainting objects takes much try and error to get decent results and its controlnet face swap is not good at all.
Fooocus (v2.1.824) : the initial test was slow, 8 minutes for 1024x1024 image in 20 steps. I had to change a few lines of code to get Fooocus to run on 16 threads, i have more threads available but 16 seems to be a sweet spot somehow:
++++ (model-management.py)
if args.always_cpu:
force 16 threads
torch.set_num_threads(16)
cpu_state = CPUState.CPU
++++
I had to install and setup extra memory managent tools (MALLOC) and (ACCELERATE) on system lvl too:
++++ the system config commands are:
export MALLOCC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"
export OMP_PROC_BIND=CLOSE
export OMP_SCHEDULE=STATIC
export KMP_AFFINITY=granularity=fine,compact,1,0
export OMP_NUM_THREAD=16
export GOMP_CPU_AFFINITY="0-15"
export ONEDNN_PRIMITIVE_CACHE_CAPACITY=200
++++
Then start Fooocus with the following command line:
numactl --all accelerate launch --num_cpu_threads_per_process=16 launch.py --cpu --fp32-vae --fp32-text-enc --disable-xformers --dont-print-server --disable-metadata --use-pytorch-cross-attention
++++
Now Fooocus runs more like greased lighting with only CPU support (oh, and some multi threading of course ;-)))
2 minutes for 1024x1024 image in 20 steps with juggernaut
under 1 minute for 1024x1024 image in 8 steps with LCM
1.20 minutes for 1024x1024 image 8 steps with TurboVision
I've recently installed the latest Fooocus version, had to make the same changes in the code cos it did not utilize the 16 threads by default. Start this version (2.1.864) with the following command line (assuming you have MALLOC and ACCELERATE already installed on system lvl):
numactl --all accelerate launch --num_cpu_threads_per_process=16 launch.py --always-cpu --all-in-fp32 --disable-xformers --attention-pytorch --disable-server-log
No issues at all runs butter smooth and fine with the same previous render times. And its outpainting/faceswap is the best i've seen sofar. I have to admit i've not tested ComfyUI yet, cos i don't like doobling with noodles very much.
Hope this helps anyone who owns a fast recent/decent CPU (8 or more threads pref.) only machine with alot of memory.
You can run Fooocus with 2 minutes or less render time on a beefy INTEL CPU machine.
I can't help Apple or AMD users here, maybe the same solution i've discripted above will work for them too.
Lastly here is another advice i want to give to all you happy Fooocus users out there. Please be very conservative with installing extra models (checkpoints/lora's) cos they are very large (in GB's) and they will overload your sparse memory. Just sticking with the basic juggernaut, 1 default lora plus some default models fooocus loads at startup is around 14 GB!
So if you have like 8 GB of RAM (or VRAM for GPU's) then 6 GB will go to virtual memory which is disk allocated memory which is slow in read/write (this will give bad render-times). Any extra models you install will exponentially increase the memory allocation, i have monitored it with htop under linux. Even if you only select 1 checkpoint model for rendering and you have like 4 extra models in the checkpoint folder which you don't use any more. Still all will be loaded into memory at startup. Don't know if this is a bug maybe the dev team can give an answer here. I've seen memory consumption go up the elevator and through the roof, 40 GB and sometimes even more in that situation with more large models installed. So keep your checkpoint folder tidy and move models you don't use to an other folder outside the model directory of Fooocus.
If you want to share models between other stable diffision apps then use the symbolic link (linux) command. The following command will created a symbolic link of the TurboVision checkpoint file so you can select it in Fooocus:
ln -s ~/Fooocus/models/saved/Turbo. ~/Fooocus/models/checkpoints
When your done rendering and want another model you can remove the symbolic checkpoint with the following command:
rm -rf ~/Fooocus/models/checkpoints/Turbo.
Or just use your drag-and-drop/kick-to-the-trash-bin the filemanager.
Thank you for the kind words, much appreciate it! I'd like to post the link to this issue in a discussion to keep it there, but will close the issue as there is no actual issue. Thanks again!
First let me thank you @lllyasviel and the whole dev team 4 this great product you have made.
I have tested and compared nearly all Stable Diffusion products out there on the open source market for 2 months now. I've run them all on Intel CPU only, so no GPU testing. Here are my results and the issues i have encountered while running the tests:
force 16 threads
torch.set_num_threads(16) cpu_state = CPUState.CPU ++++ I had to install and setup extra memory managent tools (MALLOC) and (ACCELERATE) on system lvl too: ++++ the system config commands are: export MALLOCC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000" export OMP_PROC_BIND=CLOSE export OMP_SCHEDULE=STATIC export KMP_AFFINITY=granularity=fine,compact,1,0 export OMP_NUM_THREAD=16 export GOMP_CPU_AFFINITY="0-15" export ONEDNN_PRIMITIVE_CACHE_CAPACITY=200 ++++ Then start Fooocus with the following command line: numactl --all accelerate launch --num_cpu_threads_per_process=16 launch.py --cpu --fp32-vae --fp32-text-enc --disable-xformers --dont-print-server --disable-metadata --use-pytorch-cross-attention ++++ Now Fooocus runs more like greased lighting with only CPU support (oh, and some multi threading of course ;-))) 2 minutes for 1024x1024 image in 20 steps with juggernaut under 1 minute for 1024x1024 image in 8 steps with LCM 1.20 minutes for 1024x1024 image 8 steps with TurboVision
I've recently installed the latest Fooocus version, had to make the same changes in the code cos it did not utilize the 16 threads by default. Start this version (2.1.864) with the following command line (assuming you have MALLOC and ACCELERATE already installed on system lvl): numactl --all accelerate launch --num_cpu_threads_per_process=16 launch.py --always-cpu --all-in-fp32 --disable-xformers --attention-pytorch --disable-server-log
No issues at all runs butter smooth and fine with the same previous render times. And its outpainting/faceswap is the best i've seen sofar. I have to admit i've not tested ComfyUI yet, cos i don't like doobling with noodles very much.
Hope this helps anyone who owns a fast recent/decent CPU (8 or more threads pref.) only machine with alot of memory. You can run Fooocus with 2 minutes or less render time on a beefy INTEL CPU machine. I can't help Apple or AMD users here, maybe the same solution i've discripted above will work for them too.
Lastly here is another advice i want to give to all you happy Fooocus users out there. Please be very conservative with installing extra models (checkpoints/lora's) cos they are very large (in GB's) and they will overload your sparse memory. Just sticking with the basic juggernaut, 1 default lora plus some default models fooocus loads at startup is around 14 GB! So if you have like 8 GB of RAM (or VRAM for GPU's) then 6 GB will go to virtual memory which is disk allocated memory which is slow in read/write (this will give bad render-times). Any extra models you install will exponentially increase the memory allocation, i have monitored it with htop under linux. Even if you only select 1 checkpoint model for rendering and you have like 4 extra models in the checkpoint folder which you don't use any more. Still all will be loaded into memory at startup. Don't know if this is a bug maybe the dev team can give an answer here. I've seen memory consumption go up the elevator and through the roof, 40 GB and sometimes even more in that situation with more large models installed. So keep your checkpoint folder tidy and move models you don't use to an other folder outside the model directory of Fooocus.
If you want to share models between other stable diffision apps then use the symbolic link (linux) command. The following command will created a symbolic link of the TurboVision checkpoint file so you can select it in Fooocus: ln -s ~/Fooocus/models/saved/Turbo. ~/Fooocus/models/checkpoints When your done rendering and want another model you can remove the symbolic checkpoint with the following command: rm -rf ~/Fooocus/models/checkpoints/Turbo. Or just use your drag-and-drop/kick-to-the-trash-bin the filemanager.
Thanks and best regards to you all.