kohya-ss / sd-scripts

Apache License 2.0
4.87k stars 812 forks source link

Lora training issues #962

Open Makangx opened 9 months ago

Makangx commented 9 months ago

i am having the this issues (some others guys on reddit also has the similar issues with training after they updated their kohya ) i am going to reinstall/ repair python ( this is the only thing I have not tried else i have tried deleting venv, changing kohya directory, deleting kohya and reinstalling kohya from scratch and performed the methods mentioned in #405

sdxl_train_network.py and train_util.py and train_network.py always comes up with errors while i am trying to train Lora (different lines error each time with fresh installation - I installed kohya 3 times and all 3 had different sdxl_train_network.py and train_util.py and train_network.py errors to each other.

Traceback (most recent call last): File "G:\kohya_ss\sdxl_train_network.py", line 185, in trainer.train(args) File "G:\kohya_ss\train_network.py", line 342, in train optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params) File "G:\kohya_ss\library\train_util.py", line 3403, in get_optimizer value = ast.literal_eval(value) File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\ast.py", line 110, in literal_eval return _convert(node_or_string) File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\ast.py", line 109, in _convert return _convert_signed_num(node) File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\ast.py", line 83, in _convert_signed_num return _convert_num(node) File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\ast.py", line 74, in _convert_num _raise_malformed_node(node) File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\ast.py", line 71, in _raise_malformed_node raise ValueError(msg + f': {node!r}') ValueError: malformed node or string on line 1: <ast.Name object at 0x000001C6BE29C1C0> Traceback (most recent call last): File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\ZEUS\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "G:\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in File "G:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "G:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command simple_launcher(args) File "G:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['G:\kohya_ss\venv\Scripts\python.exe', './sdxl_train_network.py',

rockerBOO commented 9 months ago

optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params) ValueError: malformed node or string on line 1: <ast.Name object at 0x000001C6BE29C1C0>

This hints at something in your optimizer_args is causing it to fail to parse it. What is your optimizer_args input?

Makangx commented 9 months ago

optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params) ValueError: malformed node or string on line 1: <ast.Name object at 0x000001C6BE29C1C0>

This hints at something in your optimizer_args is causing it to fail to parse it. What is your optimizer_args input?

i used "scale_parameter=False relative_step=False warmup_init=False" args 1 time only, before posting the issue here. And in initial tries the inputs were Blank. i have my json saved files, i used them, well they all are with different parameters. for optimizers - i tried = AdamW, AdamW8bit, Adafactor, DAdaptation LR Scheduler - Cosine, Constant, Constant with Warmup, Adafactor Mixed and saved precision = bf16 as well as fp16 train batch size = 1 and 2 ( tried with both) cache latents = ON Max reso - 512,512 and 2nd try 768 and 3rd 1024 again reduced to 512 Enable Buckets = ON (1st) , Later OFF Gradient checkpoint = ON Memory Efficient = ON Cross Attention = Xformers Network Rank and Alpha = Max 64 and 32 everything else is default Card is RTX 3060 ( 0% utilization, Not engaged in any other applications during or before the training. ) i downloaded older version of kohya just to check - 22.0.0 / 22.1.0 / 22.1.1 ( installed them and tried to train but no luck) Repair and Reinstalled Python Now Downloading some New updated Nvidia Drivers and Cuda drivers.
i don't know if this will help or not but just in case

rockerBOO commented 9 months ago

if it's erroring is it continuing to do the identical error?

Makangx commented 9 months ago

if it's erroring is it continuing to do the identical error?

my first 2 results were decent ( without any single errors, it took 3 hours with 220 something photos), i created 2 decent lora' and after a month or so i updated my nvidia drivers, only thing I changed or updated ( in my perspective i thought it would not cause any issues). and after few days i updated lora and everything is here. i will try a few more times with different parameters and will post the outcomes.

Makangx commented 9 months ago

if it's erroring is it continuing to do the identical error?

i forgot to mention - I have not used "regularization images" in previous attempts (except for initial decent Lora's), but now I have generated some "regularization images" through SDXL which I am planning to use this time. Is it possible that not using “Regularized Images” may cause some kind of problem?

rockerBOO commented 9 months ago

if it's erroring is it continuing to do the identical error?

i forgot to mention - I have not used "regularization images" in previous attempts (except for initial decent Lora's), but now I have generated some "regularization images" through SDXL which I am planning to use this time. Is it possible that not using “Regularized Images” may cause some kind of problem?

its specifically an optimizer argument but you'd have to share your argument and the same error that you shown above to know what the error could be caused by.