MoE offloading and code simplification

gabe56f commented 9 months ago

Make model creation a bit more readable and implement an LRU cache-based offload for MoE layers, which makes 4x2 SDXL models able to run on 12gb vram.

Example usage:

from segmoe import SegMoEPipeline as seg
import torch

pipe = seg("SegMoE-4x2-v0", on_device_layers=800)
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

Warlord-K commented 9 months ago

Hi @gabe56f, Thank you for you contribution, I have checked the loading code and fixed a small bug corresponding to the model loading of experts from civit.

But from my tests so far, the inference memory requirement for 4x2 SDXL models is still ~18 GB, Am I missing something here?

gabe56f commented 9 months ago

I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you torch.cuda.empty_cache() before generating?

VRAM full

VRAM isn't full

for some context, meganime-mix-4x2 is just a custom merge of 4 different anime SDXL models. If needed, here's the config.yaml

Once again, this doesn't save total memory required, it just saves GPU device memory required.

Warlord-K commented 9 months ago

I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you torch.cuda.empty_cache() before generating?

for some context, meganime-mix-4x2 is just a custom merge of 4 different anime SDXL models. If needed, here's the config.yaml

Once again, this doesn't save total memory required, it just saves GPU device required.

Yeah, I have gotten the memory savings now. I have set the cache to empty after both model loading and inference in the class itself, so that it doesn't have to be manually cleared everytime.

Warlord-K commented 9 months ago

I also got the inference running at <8GB with 400 GPU layers, which is awesome and should enable a lot more users to be able run it locally. Thanks again for your contribution! I will update the README to highlight the optimized memory usage.

Warlord-K commented 9 months ago

Although The GPU VRAM utilization has decreased the peak memory requirement is still higher. When I was trying to SegMoE-4x2 on this configuration: 16GB A6000 and 64 GB RAM, I constantly got cuda out of memory error with 800 and 600 on device layers, How are you running it?

gabe56f commented 9 months ago

I think the bottleneck from this point on is gonna be the loading of the model -- loading it with on_device_layers>0 makes RAM and VRAM spike for whatever reason. I've run into this exact issue over at your old project repo VoltaML fast stable diffusion and I think it was this change that fixed the issues.

The other reason I can think of is gonna be the VAE, which is once again, the diffusers library's problem, since SDXL VAEs need to run at FP32 due to precision issues. This is only gonna be present at the end of every generation.

If you load using the following the OOM issues should hopefully be fixed:

from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch

pipe = seg("SegMoE-4x2-v0", on_device_layers=800)

# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
# no longer necessary
# torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings

pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

From my testing 800 layers on GPU is around 11gb, 1200 is 15gb. so for SDXL models it's a flat (roughly) 1gb/100 layers. I don't think this has much use for SD1.5 models, maybe if there are some extreme 10x3 merges it'll prove useful for SD1.5 as well

gabe56f commented 9 months ago

I've also went ahead and made it so that it's possible to change on_device_layers and scheduler_class without having to reinstate pipelines.

from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch

# try without offload
pipe = seg("SegMoE-4x2-v0")

# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings

pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

# try with offload=400
pipe.on_device_layers = 400  # automatically offloads every MoE block to CPU
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_400.png")

# try with offload=1200
pipe.on_device_layers = 1200  # automatically offloads every MoE block to CPU
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200.png")

# also be able to change schedulers
pipe.scheduler_class = "EulerAncestralDiscreteScheduler"
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200_euler.png")

Warlord-K commented 8 months ago

@gabe56f Hi, sorry for the delay but I have been trying to get it to work on a machine with 16GB VRAM and 64BG VRAM and it still doesn't work, could you try to run the 2x1 model on colab and share the colab with me?

imba-pericia commented 7 months ago

Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram. 1.5 6x3 runs on 6gb vram, slowly)

gabe56f commented 7 months ago

Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram. 1.5 6x3 runs on 6gb vram, slowly)

was about to say this, been having loooooaaaads of issues even with loading models on colab due to 12gb of available ram...

unet = self.create_empty(cached_folder) seems like the part with the big ram issues, could look into that

Warlord-K commented 7 months ago

I am running it on a server with 64GB RAM and 16GB VRAM, but it doesn't load, I tried the exact code but it always goes out of memory, could you try with 2x1 on colab?

segmind / segmoe

MoE offloading and code simplification #21