Request for Optimization, fixing issues with CUDA devices

ogkalu2 / Merge-Stable-Diffusion-models-without-distortion

Adaptation of the merging method described in the paper - Git Re-Basin: Merging Models modulo Permutation Symmetries (https://arxiv.org/abs/2209.04836) for Stable Diffusion

MIT License

139 stars 21 forks source link

Request for Optimization, fixing issues with CUDA devices #23

Open LumiWasTaken opened 1 year ago

LumiWasTaken commented 1 year ago

Hey there!

I really love the project and the idea behind it.

Sadly i lack infos to run it properly.

On my Device (3060) it runs via GPU very quickly runs into an OOM issue maximizing my 12Gb VRAM when merging 2x 2GB Models in CUDA

Its unclear if the Script is able to handle float16 and float32 mixes or the error "dot function not implemented for 'half'" is a user / env issue.

Fix issues like <class 'KeyError'> 'model_ema.decay' For some models that are based on NovelAi or are unpruned?

I'd like to have more infos about your current enviroment.

I have desperately tried to get it working on a RTX 5000 but despite all efforts all attempts to run it on a GPU run into a OOM issue.

Also Feature request for: Saving the Model after x Iterations so when i do i can compare results. i have found that after a certain iteration count the results get worse than expected. Renaming default output name "merge.ckpt" to something like "model_a_name_without_ext--model_b_name_without_ext--alpha--xxxiter.ckpt"

LumiWasTaken commented 1 year ago

Or when running in GPU / CUDA mode its common to have this issue:

<class 'RuntimeError'> Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

brucethemoose commented 1 year ago

Yeah, IDK what the memory requirements are but it maxxes out my 16GB of RAM and eats tons of swap.

And I also noticed some models don't work as an "A" input (with the error you described), but will work as a "B" input.

LumiWasTaken commented 1 year ago

Yeah, IDK what the memory requirements are but it maxxes out my 16GB of RAM and eats tons of swap.

And I also noticed some models don't work as an "A" input (with the error you described), but will work as a "B" input.

talking about VRAM

brucethemoose commented 1 year ago

Yeah, IDK what the memory requirements are but it maxxes out my 16GB of RAM and eats tons of swap. And I also noticed some models don't work as an "A" input (with the error you described), but will work as a "B" input.

talking about VRAM

Yeah, but my theory is that if RAM usage is that high, setting the device to GPU will probably require a similar amount of memory.

LumiWasTaken commented 1 year ago

Yeah, IDK what the memory requirements are but it maxxes out my 16GB of RAM and eats tons of swap. And I also noticed some models don't work as an "A" input (with the error you described), but will work as a "B" input.

talking about VRAM

Yeah, but my theory is that if RAM usage is that high, setting the device to GPU will probably require a similar amount of memory.

That sounds fair, so using 16GB of RAM as an equivalent is okay.

But i ran it on a GPU with 24GB VRAM and for testing on a A100 40GB and it did max it out again and ran into an error... so there is that issue

ogkalu2 commented 1 year ago

Hi. Sorry to hear that. Even I'm unsure of the exact requirements at this point. Can you try running this commit and see if it works. I think this was was slower but used less resources 93b0e95ca107fec6a1ddf8153543268ff18010b9

LumiWasTaken commented 1 year ago

Hi. Sorry to hear that. Even I'm unsure of the exact requirements at this point. Can you try running this commit and see if it works. I think this was was slower but used less resources 93b0e95

In this case i run into the issue again:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

SD_rebasin_merge.py --model_a "fileA.ckpt" --model_b "fileB.ckpt" --device cuda

ogkalu2 commented 1 year ago

Make your device Cpu. It'll still run on cuda on the parts that it can. Of that's what you've been doimg, You should also try that for latest commit

LumiWasTaken commented 1 year ago

Make your device Cpu. It'll still run on cuda on the parts that it can. Of that's what you've been doimg, You should also try that for latest commit

well i have seen 0% gpu utilization and 100% cpu grafik

brucethemoose commented 1 year ago

Merges are reasonably fast on CPU, thats not really an issue IMO since they are so infrequent.

But being locked to torch 11 because of the cpu requirement kinda is an issue 🤔.

LumiWasTaken commented 1 year ago

Merges are reasonably fast on CPU, thats not really an issue IMO.

Being locked to torch 11 because of the cpu requirement kinda is though 🤔.

Its not really for me... and especially when i wanna do a larger batch of model merges via a seperate script its a bit meh

brucethemoose commented 1 year ago

Merges are reasonably fast on CPU, thats not really an issue IMO. Being locked to torch 11 because of the cpu requirement kinda is though 🤔.

Its not really for me... and especially when i wanna do a larger batch of model merges via a seperate script its a bit meh

Yeah but even a mega merge script is still gonna take less than 5 minutes.

In ML world, thats basically free :P

brucethemoose commented 1 year ago

(For reference a merge finishes in like 30 seconds on my 8C 4900HS running linux)