kijai / ComfyUI-FluxTrainer

Apache License 2.0
465 stars 23 forks source link

New Error After Update: module 'torch' has no attribute 'float8_e4m3fnuz' #47

Open Orenji-Tangerine opened 1 month ago

Orenji-Tangerine commented 1 month ago

image

My pytorch version is 2.1.2+ cu118, only workable after reverting back to commit 77587e0

image

kijai commented 1 month ago

You should be using torch 2.4.1, kohya recommends 2.4.0 minimum.

Bellzs commented 1 month ago

have you update the torch version?does it work?I am using the same torch version:2.1.2+cu118

Orenji-Tangerine commented 1 month ago

have you update the torch version?does it work?I am using the same torch version:2.1.2+cu118

I have separate environments that run different pytorch versions ((like 2.3.1+ cu11.8, 2.3.1 + cu12.1) and it seems that only 2.1.2+cu11.8 (for anything below 2.4.0) works best without error for commit 77587e0 and below. Pytorch version before 2.4.0 will be OOM on my system. I havent tried 2.4.0 yet. xformers 0.0.28 post1 is out today to match pytorch 2.4.1 + cu12.4, I will try someday for this

Orenji-Tangerine commented 1 month ago

I have updated the environment Pytorch 2.4.1+ cu12.4 with Python 3.11.6 So long the gradient checkpointing enabled with cpu offloading, there will be no OOM issues. I did notice iteration speed increased from 4.xx s/it to 5.2-.5.3 s/it (training under 512*512 with CAME optimizer, memory peak around 14.3GB). Edited: The training time 1000 steps, 10 photos x 10 repeats x 10 epochs has increased from 75 mins to 90 mins though. (Another environment running 2.1.2 + cu 11.8, commit 77587e0 finishes training in 75 mins with the same setting. I think it is something to do with Pytorch) (PC running with 4060Ti 16GB + 64GB DDR4)