Black images OR very poor performance on M1 / Silicon MacBook Pro – "invalid value encountered in cast"

daniel-hmc commented 9 months ago

Read Troubleshoot

[x] I admit that I have read the Troubleshoot before making this issue.

Describe the problem

Fooocus 2.1.864 Python 3.10.13 macOS 12.6.1 <-- CULPRIT! After upgrade to Sonoma (macOS 14.3) I can now generate images at 4s/it using GPU (without --always-cpu). :) Hardware: MacBook Pro 14" 2021, CPU M1 Max, 64GB RAM

By default (no command line parameters), when beginning to generate images in Fooocus, the first iterations appear normal, then suddenly preview becomes black and final output is also a black image. This is 99% reproducible. In very rare cases the image is generated correctly. Performance is good with abt. 5s/it.

Workaround

With --always-cpu this black image issue can be avoided, images are generated 100% reliably. However, performance degrades heavily to abt. 30s/it.

Debugging attempts

So I tried to debug the black image issue. Here are my findings. I am not an AI expert nor a Python expert, hence I got stuck soon.

Error message in log in the moment the first image turns black is

/path/to/fooocus/core.py:260: RuntimeWarning: invalid value encountered in cast
  x_sample = x_sample.cpu().numpy().clip(0, 255).astype(np.uint8)

So I inserted debug output for x0 (which is the source for x_sample) in core.py before line 257 before any further processing of the tensor x0:

255     def preview_function(x0, step, total_steps):
256          with torch.no_grad():
new              print(x0)
257              x_sample = x0.to(VAE_approx_model.current_type)
258              x_sample = VAE_approx_model(x_sample) * 127.5 + 127.5
259              x_sample = einops.rearrange(x_sample, 'b c h w -> b h w c')[0]
260              x_sample = x_sample.cpu().numpy().clip(0, 255).astype(np.uint8)
261              return x_sample
263          return preview_function

And it turned out that in the moment the first image in preview turns black, x0 begins to only contain "NaN" values:

Normal output immediately before error:


        [[ 6.6036e-01,  8.3003e-01,  8.0386e-01,  ...,  1.9428e-01,
            2.4849e-01,  2.1970e-01],
          [ 8.4505e-01,  9.7208e-01,  7.5401e-01,  ...,  2.8811e-02,
            1.0288e-01, -6.4597e-02],
          [ 8.7645e-01,  1.0089e+00,  8.8829e-01,  ...,  7.1714e-02,
            1.8642e-01,  2.4322e-02],
          ...,
          [ 1.3249e+00,  1.4565e+00,  1.5513e+00,  ...,  4.8976e-01,
            5.7445e-01,  4.0540e-01],
          [ 1.4676e+00,  1.5275e+00,  1.4628e+00,  ...,  7.2510e-01,
            7.5743e-01,  2.6538e-01],
          [ 1.0259e+00,  1.0389e+00,  1.1225e+00,  ...,  5.9361e-01,
            6.1589e-01,  2.6520e-01]]]], device='mps:0')
 13%|███████████████████████▍                                                                                                                                                        | 8/60 [00:32<03:35,  4.14s/it]

First output after error occured:


tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

When I tried to find the source of these NaN values, I could not find occurrences in the whole Fooocus tree where this function "preview_function" is called. (I'm not a developer... sorry)

So since the x0 tensor is processed in that function with VAE things (I don't know what this is, though), I tried various vae-related command line parameters:

--vae-in-fp16
--vae-in-fp32
--vae-in-bf16
--vae-in-cpu

but none of them solved the issue. Same result as before.

This is where I got stuck. Maybe someone can pick up here and debug further to find the root cause? I am glad to make any further tests, if you tell me which ones.

Further info

...that might or might not be relevant:

Info from log output that might be relevant:

Device: mps
vae.dtype: torch.float32

I tried various other command line parameters that do not seem to make any difference. However, I didn't try any possible combination of course:

--disable-offload-from-vram
--always-normal-vram
--all-in-fp32
--all-in-fp16 (this one leads to "RuntimeError: "upsample_nearest2d_channels_last" not implemented for 'Half'")

Full Console Log

Paste full console log here. You will make our job easier if you give a full log.

Okay :)

(base) daniel@MBP6 ~ % conda activate fooocus
(fooocus) daniel@MBP6 ~ % python entry_with_update.py --disable-offload-from-vram --output-path /Users/daniel/Pictures/Lightroom/LR_LOCAL/00_Autoimport_Quelle/KI-generiert
python: can't open file '/Users/daniel/entry_with_update.py': [Errno 2] No such file or directory
(fooocus) daniel@MBP6 ~ % cd fooocus/Fooocus 
(fooocus) daniel@MBP6 Fooocus % python entry_with_update.py --disable-offload-from-vram --output-path /Users/daniel/Pictures/Lightroom/LR_LOCAL/00_Autoimport_Quelle/KI-generiert
Already up-to-date
Update succeeded.
[System ARGV] ['entry_with_update.py', '--disable-offload-from-vram', '--output-path', '/Users/daniel/Pictures/Lightroom/LR_LOCAL/00_Autoimport_Quelle/KI-generiert']
Python 3.10.13 (main, Sep 11 2023, 08:16:02) [Clang 14.0.6 ]
Fooocus version: 2.1.864
Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.
Total VRAM 65536 MB, total RAM 65536 MB
Set vram state to: SHARED
Device: mps
VAE dtype: torch.float32
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
extra {'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_l.text_projection'}
Base model loaded: /Users/daniel/fooocus/Fooocus/models/checkpoints/juggernautXL_v8Rundiffusion.safetensors
Request to load LoRAs [['sd_xl_offset_example-lora_1.0.safetensors', 0.1], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0]] for model [/Users/daniel/fooocus/Fooocus/models/checkpoints/juggernautXL_v8Rundiffusion.safetensors].
Loaded LoRA [/Users/daniel/fooocus/Fooocus/models/loras/sd_xl_offset_example-lora_1.0.safetensors] for UNet [/Users/daniel/fooocus/Fooocus/models/checkpoints/juggernautXL_v8Rundiffusion.safetensors] with 788 keys at weight 0.1.
Fooocus V2 Expansion: Vocab with 642 words.
Fooocus Expansion engine loaded for cpu, use_fp16 = False.
Requested to load SDXLClipModel
Requested to load GPT2LMHeadModel
Loading 2 new models
App started successful. Use the app with http://127.0.0.1:7865/ or 127.0.0.1:7865
[Parameters] Adaptive CFG = 7
[Parameters] Sharpness = 2
[Parameters] ADM Scale = 1.5 : 0.8 : 0.3
[Parameters] CFG = 4.0
[Parameters] Seed = 6814032858962482513
[Parameters] Sampler = dpmpp_2m_sde_gpu - karras
[Parameters] Steps = 60 - 30
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
[Fooocus] Processing prompts ...
[Fooocus] Preparing Fooocus text #1 ...
[Prompt Expansion] duck, intricate, elegant, highly detailed, wonderful colors, sweet, glowing, sharp focus, beautiful, symmetry, thought, iconic, fine, epic, cinematic, colorful, background, illuminated, professional, winning, fair, true, full, composed, innocent, light, atmosphere, great composition, dynamic, lively, detail, set, ambient, vivid, luxurious
[Fooocus] Preparing Fooocus text #2 ...
[Prompt Expansion] duck, intricate, elegant, highly detailed, wonderful colors, sweet, fiery, sharp focus, cute, symmetry, fine, elite, polished, complex, enhanced, loving, caring, generous, pretty, friendly, attractive, colorful, background composed, new, relaxed, beautiful, creative, cool, color, illuminated, dramatic, lovely, unique, focused, extremely
[Fooocus] Encoding positive #1 ...
[Fooocus] Encoding positive #2 ...
[Fooocus] Encoding negative #1 ...
[Fooocus] Encoding negative #2 ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (896, 1152)
Preparation time: 11.37 seconds
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.0291671771556139, sigma_max = 14.614643096923828
Requested to load SDXL
Loading 1 new model
[Fooocus Model Management] Moving model(s) has taken 2.39 seconds
/Users/daniel/fooocus/Fooocus/ldm_patched/k_diffusion/sampling.py:699: UserWarning: MPS: nonzero op is supported natively starting from macOS 13.0. Falling back on CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Indexing.mm:283.)
  sigma_min, sigma_max = sigmas[sigmas > 0].min(), sigmas.max()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0%|                                                                                                                                                                            | 0/60 [00:00<?, ?it/s]/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/site-packages/torch/nn/functional.py:4001: UserWarning: MPS: 'nearest' mode upsampling is supported natively starting from macOS 13.0. Falling back on CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/UpSample.mm:255.)
  return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)
/Users/daniel/fooocus/Fooocus/modules/anisotropic.py:132: UserWarning: The operator 'aten::std_mean.correction' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
  s, m = torch.std_mean(g, dim=(1, 2, 3), keepdim=True)
  7%|██████████▉                                                                                                                                                         | 4/60 [00:15<03:40,  3.94s/it]/Users/daniel/fooocus/Fooocus/modules/core.py:260: RuntimeWarning: invalid value encountered in cast
  x_sample = x_sample.cpu().numpy().clip(0, 255).astype(np.uint8)
 10%|████████████████▍                                                                                                                                                   | 6/60 [00:23<03:29,  3.89s/it]^CKeyboard interruption in main thread... closing server.
 17%|███████████████████████████▏                                                                                                                                       | 10/60 [00:38<03:10,  3.81s/it]^CTraceback (most recent call last):
  File "/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/site-packages/gradio/blocks.py", line 2199, in block_thread
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/fooocus/Fooocus/entry_with_update.py", line 46, in <module>
    from launch import *
  File "/Users/daniel/fooocus/Fooocus/launch.py", line 126, in <module>
    from webui import *
  File "/Users/daniel/fooocus/Fooocus/webui.py", line 616, in <module>
    shared.gradio_root.launch(
  File "/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/site-packages/gradio/blocks.py", line 2115, in launch
    self.block_thread()
  File "/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/site-packages/gradio/blocks.py", line 2203, in block_thread
    self.server.close()
  File "/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/site-packages/gradio/networking.py", line 49, in close
    self.thread.join()
  File "/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/Users/daniel/miniconda3/envs/fooocus/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

(fooocus) daniel@MBP6 Fooocus %

mashb1t commented 9 months ago

Black images normally hint at fp16/fp32 issues when SDXL produces nan's during generation / VAE. I don't have the hardware available you're using but a M1 Macbook Pro 16'' 2021 32GB, where image generation is not an issue. I'm fairly certain this is not a general Fooocus but an individual issue. --vae-in-fp32 should work here but it seems you've already tried this. Can you pelase check if you're able to use any other model successfully, e.g. --preset sai (with and without refiner)?

daniel-hmc commented 9 months ago

--preset sai --vae-in-fp32 --> same black image issue.

--preset sai --> same black image issue.

--preset realistic --> same black image issue.

Please enlighten me on the term "refiner". What is it and how can I enable/disable this?

If this is an individual issue, where (in general) would I need to look for the root cause, if not in Fooocus code? Would a solution be a workaround to be put in Fooocus code, or how would one deal with such an issue?

mashb1t commented 9 months ago

Refiner: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0 + https://github.com/lllyasviel/Fooocus/discussions/830#discussioncomment-8363727

Basically switching the model mid-rendering. This is enabled in Preset sai but not mandatory, jsut to keep an eye on.

Please check if --all-in-fp32 helps.

daniel-hmc commented 9 months ago

--all-in-fp32 --> same black image issue

I now tried with above parameter with and without a refiner set in model tab. Without: juggernautXL With: Model sd-xl-base, refiner: sd-xl-refiner --> Both variants same black image issue.

mashb1t commented 9 months ago

Aw snap, hoped that would fix it... 😑 There are also related issues, but as far as i know no real solutions, sorry:

cybernet commented 9 months ago

same here

does anyone know how to use https://github.com/apple/ml-stable-diffusion with this project ?

daniel-hmc commented 9 months ago

Guess what? Today I upgraded from macOS Monterey 12.6.1 to Sonoma 14.3 and now image generation seems to be reliable with GPU at high performance about 4s/it. Just with: python entry_with_update.py --disable-offload-from-vram

So this ticket can be closed I guess.

mashb1t commented 9 months ago

@daniel-hmc great to hear, thank you for your feedback!

lllyasviel / Fooocus