kohya-ss / sd-scripts

Apache License 2.0
5.13k stars 854 forks source link

flux finetune always fills shared memory #1673

Open AndreRatzenberger opened 2 weeks ago

AndreRatzenberger commented 2 weeks ago

I just noticed, that fine tuning FLUX with the command in the README (dataset image size 512,512, batch size 1) will result in 8GB of shared memory

image

Is this by design, or did I something wrong? Because a week ago I could fine tune without the need for any shared memory.

Also I noticed sampling images during fine-tune now takes double the time...

That's how it looks with commit 1286e00bb0fc34c296f24b7057777f1c37cf8e11

image

No shared memory, twice the speed during training and sampling images

Win10/4090

Thx

kohya-ss commented 2 weeks ago

This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.

Vigilence commented 1 week ago

This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.

So shared memory is intended behavior? Wouldn't it be faster to fill up the ram instead or is that instead slower?

cyan2k commented 1 week ago

This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.

Thanks for the reply! Good to know!

Any additional ideas on how to get the sampling during training time down? It doesn't matter, even if I swap 30 blocks, sampling still takes 2min per image on a 4090, which is annoying :D

kohya-ss commented 1 week ago

So shared memory is intended behavior? Wouldn't it be faster to fill up the ram instead or is that instead slower?

PyTorch seems to use shared memory with non-blocking (asynchronous) transfers from CUDA to the CPU, which seems to be slightly faster than synchronous transfers.

kohya-ss commented 1 week ago

Any additional ideas on how to get the sampling during training time down? It doesn't matter, even if I swap 30 blocks, sampling still takes 2min per image on a 4090, which is annoying :D

This is certainly annoying, but for example, when inferencing with 20 steps, 20*30=600 blocks are transferred in total. So, to speed it up, we need to stop transferring some blocks, but doing so may cause OOM in some cases.