Open Orenji-Tangerine opened 1 month ago
You should be using torch 2.4.1, kohya recommends 2.4.0 minimum.
have you update the torch version?does it work?I am using the same torch version:2.1.2+cu118
have you update the torch version?does it work?I am using the same torch version:2.1.2+cu118
I have separate environments that run different pytorch versions ((like 2.3.1+ cu11.8, 2.3.1 + cu12.1) and it seems that only 2.1.2+cu11.8 (for anything below 2.4.0) works best without error for commit 77587e0 and below. Pytorch version before 2.4.0 will be OOM on my system. I havent tried 2.4.0 yet. xformers 0.0.28 post1 is out today to match pytorch 2.4.1 + cu12.4, I will try someday for this
I have updated the environment Pytorch 2.4.1+ cu12.4 with Python 3.11.6 So long the gradient checkpointing enabled with cpu offloading, there will be no OOM issues. I did notice iteration speed increased from 4.xx s/it to 5.2-.5.3 s/it (training under 512*512 with CAME optimizer, memory peak around 14.3GB). Edited: The training time 1000 steps, 10 photos x 10 repeats x 10 epochs has increased from 75 mins to 90 mins though. (Another environment running 2.1.2 + cu 11.8, commit 77587e0 finishes training in 75 mins with the same setting. I think it is something to do with Pytorch) (PC running with 4060Ti 16GB + 64GB DDR4)
My pytorch version is 2.1.2+ cu118, only workable after reverting back to commit 77587e0