High RAM usage on DirectML

JarekDerp commented 7 months ago

@mashb1t as requested yesterday, I'm pasting the content of the console and RAM and VRAM usage. I though it was normal but you informed me that something might be wrong.

When the generation starts, the CPU usage goes up, RAM gets filled, and then slowly VRAM gets filled in as well. When GPU starts working on the image, the CPU usage goes down. So looks like some tasks are done by CPU.

After generation is done, VRAM is still full, but the VRAM goes down to about 16-20 GB out of 32GB.

Here's the console log, I run it with "--debug" parameter in case it makes a difference:

[System ARGV] ['C:\\StabilityMatrix-win-x64\\Data\\Packages\\Fooocus\\launch.py', '--preset', 'realistic', '--directml', '--disable-xformers', '--debug']
Loaded preset: C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\presets\realistic.json
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Fooocus version: 2.1.859
Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.
Using directml with device: 
Total VRAM 1024 MB, total RAM 32637 MB
Set vram state to: NORMAL_VRAM
Always offload VRAM
Device: privateuseone
VAE dtype: torch.float32
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
extra {'cond_stage_model.clip_l.text_projection', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.logit_scale'}
Base model loaded: C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors
Request to load LoRAs [['SDXL_FILM_PHOTOGRAPHY_STYLE_BetaV0.4.safetensors', 0.25], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0]] for model [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors].
Loaded LoRA [C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\models\loras\SDXL_FILM_PHOTOGRAPHY_STYLE_BetaV0.4.safetensors] for UNet [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors] with 788 keys at weight 0.25.
Loaded LoRA [C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\models\loras\SDXL_FILM_PHOTOGRAPHY_STYLE_BetaV0.4.safetensors] for CLIP [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors] with 264 keys at weight 0.25.
Fooocus V2 Expansion: Vocab with 642 words.
Fooocus Expansion engine loaded for cpu, use_fp16 = False.
Requested to load SDXLClipModel
Requested to load GPT2LMHeadModel
Loading 2 new models
[Fooocus Model Management] Moving model(s) has taken 2.31 seconds
App started successful. Use the app with http://127.0.0.1:7865/ or 127.0.0.1:7865
Enter LCM mode.
[Fooocus] Downloading LCM components ...
[Parameters] Adaptive CFG = 1.0
[Parameters] Sharpness = 0.0
[Parameters] ADM Scale = 1.0 : 1.0 : 0.0
[Parameters] CFG = 1.0
[Parameters] Seed = 3904760946643745264
[Parameters] Sampler = lcm - lcm
[Parameters] Steps = 8 - 8
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
Request to load LoRAs [['SDXL_FILM_PHOTOGRAPHY_STYLE_BetaV0.4.safetensors', 0.25], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0], ('sdxl_lcm_lora.safetensors', 1.0)] for model [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors].
Loaded LoRA [C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\models\loras\SDXL_FILM_PHOTOGRAPHY_STYLE_BetaV0.4.safetensors] for UNet [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors] with 788 keys at weight 0.25.
Loaded LoRA [C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\models\loras\SDXL_FILM_PHOTOGRAPHY_STYLE_BetaV0.4.safetensors] for CLIP [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors] with 264 keys at weight 0.25.
Loaded LoRA [C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\models\loras\sdxl_lcm_lora.safetensors] for UNet [C:\StabilityMatrix-win-x64\Data\Models\StableDiffusion\realisticStockPhoto_v10.safetensors] with 788 keys at weight 1.0.
Requested to load SDXLClipModel
Loading 1 new model
unload clone 1
[Fooocus Model Management] Moving model(s) has taken 1.79 seconds
[Fooocus] Processing prompts ...
[Fooocus] Encoding positive #1 ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (1280, 768)
Preparation time: 5.12 seconds
Using lcm scheduler.
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.39970144629478455, sigma_max = 14.614640235900879
Requested to load SDXL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 14.78 seconds
100%|██████████| 8/8 [00:35<00:00,  4.43s/it]
Requested to load AutoencoderKL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 1.10 seconds
Image generated with private log at: C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\outputs\2024-01-03\log.html
Generating and saving time: 54.74 seconds
Total time: 81.82 seconds

Oh yes, I forgot to write it. Relates to: https://github.com/lllyasviel/Fooocus/issues/1690 Continuation form: https://github.com/lllyasviel/Fooocus/issues/970

mashb1t commented 7 months ago

Thank you for the feedback and insights. Relates to #970, and my intention was that you post this in https://github.com/lllyasviel/Fooocus/issues/1690, so people with similar problems can connect ^^ I can't see an indication of "loading the models multiple times", seems ok so far, but still AMD-ok with high resource usage compared to Nvidia GPUs. The VRAM usage in the 2nd gemeration is also fine as the model isn't unloaded between images, and depending on parameters the model then stays in VRAM. Maybe DirectML causes the model to stay loaded, but normally it gets unloaded after usage when being done generating all images for a batch. This seems maybe a bit off, afaik there is general, non-vendor-specific handling in the code for unloading / freeing memory. I sadly do not have an AMD GPU available to test so you might do an additional test and check if the model is correctly unloaded when switching the model (at least briefly) or if they just stack in VRAM/RAM, which would indicate a memory leak. Thank you again for the analysis, much appreciated.

JarekDerp commented 7 months ago

I think it's working fine as well.

Running the first image takes about 90s. Then running "image number: 2" one after another takes 120s in total. So that looks normal.
I changed a checkpoint to another one. It was thrashing the memory from RAM as 32GB was not enough. Thrashing a bit on the first image and then quite a lot on the second image. Computer became a bit unresponsive from time to time, but 2 images completed in 140s total.
Started generation, and then stopped it after a couple seconds. The RAM got filled up to 31GB and stayed like this. I then started generation of 3 images and it resolved fine with no memory thrashing this time. All 3 images generated with no problems, in about 150s total.

I think it's working fine. A bit of memory problems when switching models but it's not machine/app-breaking, at least on my end. Still, I think people with only 16GB of RAM or less would struggle with Windows+AMD+DirectML configuration. But it's up to you to decide if it's worth mentioning or not.

Thanks for the great work, and many successes in the future!

JarekDerp commented 7 months ago

I just got a random error/information in the log when running one image.

loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 15.42 seconds
  0%|          | 0/30 [00:00<?, ?it/s]C:\StabilityMatrix-win-x64\Data\Packages\Fooocus\modules\anisotropic.py:132: UserWarning: The operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a\_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.)
  s, m = torch.std_mean(g, dim=(1, 2, 3), keepdim=True)
  7%|▋         | 2/30 [00:26<05:51, 12.54s/it]

But the generation speed seems quite normal, just a heads up.

Handpuppe commented 4 months ago

I have the same with the AMD 7900 xtx. Even when I am done with generating an image and Fooocus is not doing anything, my VRAM of 24GB keep maxed out.

JarekDerp commented 4 months ago

That's how Directml works. It doesn't offload the model from VRAM to RAM when idle.

JarekDerp commented 1 month ago

The main issue here is that using directml is a low effort workaround from Microsoft that can be used to make SD work on AMD cards. Either Microsoft would have to put some work into directml; or AMD make generation work on their cards without any workarounds. IMO, if you have an AMD card you have two choices:

Switch to Linux based system (free)
Change your card to Nvidia (expensive, filling in Nvidia pockets even more)
Stop generating locally and rent a virtual pc with a decent card (a subscription)

I have a 12gb vram AMD card and I don't hope to run anything else then a pruned 1.5 SD model.

lllyasviel / Fooocus

High RAM usage on DirectML #1724