Open AndreRatzenberger opened 1 month ago
This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.
This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.
So shared memory is intended behavior? Wouldn't it be faster to fill up the ram instead or is that instead slower?
This is the intended behavior. We are speeding up block swap instead of using shared memory. If you find it slower than before, please try increasing the number of swap blocks.
Thanks for the reply! Good to know!
Any additional ideas on how to get the sampling during training time down? It doesn't matter, even if I swap 30 blocks, sampling still takes 2min per image on a 4090, which is annoying :D
So shared memory is intended behavior? Wouldn't it be faster to fill up the ram instead or is that instead slower?
PyTorch seems to use shared memory with non-blocking (asynchronous) transfers from CUDA to the CPU, which seems to be slightly faster than synchronous transfers.
Any additional ideas on how to get the sampling during training time down? It doesn't matter, even if I swap 30 blocks, sampling still takes 2min per image on a 4090, which is annoying :D
This is certainly annoying, but for example, when inferencing with 20 steps, 20*30=600 blocks are transferred in total. So, to speed it up, we need to stop transferring some blocks, but doing so may cause OOM in some cases.
I just noticed, that fine tuning FLUX with the command in the README (dataset image size 512,512, batch size 1) will result in 8GB of shared memory
Is this by design, or did I something wrong? Because a week ago I could fine tune without the need for any shared memory.
Also I noticed sampling images during fine-tune now takes double the time...
That's how it looks with commit 1286e00bb0fc34c296f24b7057777f1c37cf8e11
No shared memory, twice the speed during training and sampling images
Win10/4090
Thx