kohya-ss / sd-scripts

Apache License 2.0
5.31k stars 880 forks source link

flux finetune always fills shared memory #1673

Open AndreRatzenberger opened 1 month ago

AndreRatzenberger commented 1 month ago

I just noticed, that fine tuning FLUX with the command in the README (dataset image size 512,512, batch size 1) will result in 8GB of shared memory

image

Is this by design, or did I something wrong? Because a week ago I could fine tune without the need for any shared memory.

Also I noticed sampling images during fine-tune now takes double the time...

That's how it looks with commit 1286e00bb0fc34c296f24b7057777f1c37cf8e11

image

No shared memory, twice the speed during training and sampling images

Win10/4090

Thx

kohya-ss commented 1 month ago

This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.

Vigilence commented 1 month ago

This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.

So shared memory is intended behavior? Wouldn't it be faster to fill up the ram instead or is that instead slower?

cyan2k commented 1 month ago

This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.

Thanks for the reply! Good to know!

Any additional ideas on how to get the sampling during training time down? It doesn't matter, even if I swap 30 blocks, sampling still takes 2min per image on a 4090, which is annoying :D

kohya-ss commented 1 month ago

So shared memory is intended behavior? Wouldn't it be faster to fill up the ram instead or is that instead slower?

PyTorch seems to use shared memory with non-blocking (asynchronous) transfers from CUDA to the CPU, which seems to be slightly faster than synchronous transfers.

kohya-ss commented 1 month ago

Any additional ideas on how to get the sampling during training time down? It doesn't matter, even if I swap 30 blocks, sampling still takes 2min per image on a 4090, which is annoying :D

This is certainly annoying, but for example, when inferencing with 20 steps, 20*30=600 blocks are transferred in total. So, to speed it up, we need to stop transferring some blocks, but doing so may cause OOM in some cases.