kohya-ss / sd-scripts

Apache License 2.0
5.3k stars 880 forks source link

returned non-zero exit status 3221225477 problem with Kohya lora training #567

Open Garano11 opened 1 year ago

Garano11 commented 1 year ago

Can the problem be that I have GTX 1050 ti 4 GB? (playing with options to lower VRAM usage does not help), When I play with settings I get the same thing but the last thing changes to returned non-zero exit status 1. Screenshot (261)

CalledProcessError: Command '['C:\Python3109\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=C:/Users/Maroš/Desktop/mi1-output\img', '--reg_data_dir=C:/Users/Maroš/Desktop/mi1-output\reg', '--resolution=512,512', '--output_dir=C:/Users/Maroš/Desktop/mi1-output\model', '--logging_dir=C:/Users/Maroš/Desktop/mi1-output\log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=152', '--train_batch_size=1', '--max_train_steps=1520', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 3221225477.

kohya-ss commented 1 year ago

I think it is very tough to run a training with 4 GB. That said, I believe there are further error logs outputting before or after, could you please share them?

Garano11 commented 1 year ago

Log folder in output folder is empty is there any place where I can find useful logs?

TingTingin commented 1 year ago

he meant the cmd window itself were there anymore errors to show?

TingTingin commented 1 year ago

also if your trying to save vram enabling gradient checkpointing is something youd want to do

Garano11 commented 1 year ago

Conditions are the same as before, but now I use 2 images not 19 and it is still typing: Fetching 19 files - I am confused by it.

←[1;33m============================================================= Modules installed outside the virtual environment were found. This can cause issues. Please review the installed modules.

You can uninstall all local modules with:

←[1;34mdeactivate pip freeze > uninstall.txt pip uninstall -y -r uninstall.txt

←[1;33m=============================================================←[0m

18:33:44-995157 INFO nVidia toolkit detected 18:33:45-709317 INFO Torch 1.12.1+cu116 18:33:45-725160 INFO Torch backend: nVidia CUDA 11.6 cuDNN 8302 18:33:45-728161 INFO Torch detected GPU: NVIDIA GeForce GTX 1050 Ti VRAM 4096 Arch (6, 1) Cores 6 18:33:45-729161 INFO Verifying requirements 18:33:47-463622 INFO headless: False 18:33:47-466622 INFO Load CSS...

Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). 18:34:09-441099 INFO Removing existing directory C:/Users/Maroš/Desktop/abc-output\img/40_mi1 man... 18:34:09-443100 INFO Copy C:/Users/Maroš/Desktop/abc to C:/Users/Maroš/Desktop/abc-output\img/40_mi1 man... 18:34:09-446100 INFO Regularization images directory is missing... not copying regularisation images... 18:34:09-447101 INFO Done creating kohya_ss training folder structure at C:/Users/Maroš/Desktop/abc-output... 18:34:10-990269 INFO Start training LoRA Standard ... 18:34:10-992270 INFO Folder 40_mi1 man: 2 images found 18:34:10-993269 INFO Folder 40_mi1 man: 80 steps 18:34:10-994271 INFO Total steps: 80 18:34:10-995269 INFO Train batch size: 1 18:34:10-996261 INFO Gradient accumulation steps: 1.0 18:34:10-996261 INFO Epoch: 1 18:34:10-997271 INFO Regulatization factor: 1 18:34:10-998270 INFO max_train_steps (80 / 1 / 1.0 1 1) = 80 18:34:10-999271 INFO stop_text_encoder_training = 0 18:34:11-000271 INFO lr_warmup_steps = 8 18:34:11-001271 INFO accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="C:/Users/Maroš/Desktop/abc-output\img" --resolution=512,512 --output_dir="C:/Users/Maroš/Desktop/abc-output\model" --logging_dir="C:/Users/Maroš/Desktop/abc-output\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-05 --unet_lr=0.0001 --network_dim=8 --output_name="last" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="cosine" --lr_warmup_steps="8" --train_batch_size="1" --max_train_steps="80" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale prepare tokenizer Using DreamBooth method. prepare images. found directory C:\Users\Maroš\Desktop\abc-output\img\40_mi1 man contains 2 image files No caption file found for 2 images. Training will continue without captions for these images. If class token exists, it will be used. / 2枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 C:\Users\Maroš\Desktop\abc-output\img\40_mi1 man\5116.png C:\Users\Maroš\Desktop\abc-output\img\40_mi1 man\Screenshot_2023-05-28-10-33-15-07_1c337646f29875672b5a61192b9010f9.jpg 80 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "C:\Users\Maroš\Desktop\abc-output\img\40_mi1 man" image_count: 2 num_repeats: 40 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: mi1 man caption_extension: .caption

[Dataset 0] loading image sizes. 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 285.60it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (320, 576), count: 40 bucket 1: resolution (384, 576), count: 40 mean ar error (without repeats): 0.07959181806008409 preparing accelerator Using accelerator 0.15.0 or above. loading model for process 0/1 load Diffusers pretrained models: runwayml/stable-diffusion-v1-5 safety_checker\model.safetensors not found Fetching 19 files: 100%|███████████████████████████████████████████████████████████████████████| 19/19 [00:00<?, ?it/s]

Screenshot (268)

CalledProcessError: Command '['C:\Python3109\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=C:/Users/Maroš/Desktop/abc-output\img', '--resolution=512,512', '--output_dir=C:/Users/Maroš/Desktop/abc-output\model', '--logging_dir=C:/Users/Maroš/Desktop/abc-output\log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=8', '--train_batch_size=1', '--max_train_steps=80', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 3221225477.

Garano11 commented 1 year ago

I was also able to create these errors while changing nothing, MAYBE I accidetly used Dreambooth not Dreambooth LoRa.

Screenshot (263) Screenshot (269) Screenshot (270) Screenshot (272)

kohya-ss commented 1 year ago

Thank you for sharing. It seems that there is not enough main memory before using VRAM.

image

Main memory will need to be at least 16 GB. Also, configure Windows for more virtual memory. If you have 16GB of main memory and 32GB of virtual memory, you will be able to proceed to the next step.

Conditions are the same as before, but now I use 2 images not 19 and it is still typing: Fetching 19 files - I am confused by it.

19 means the model in Hugging Face has 19 files.

WillyamBradberry commented 2 months ago

Same issue returned non-zero exit status 3221225477

24GB vram 16GB RAM 32GB virtual memory win 11

choi-hyeseong commented 2 months ago

Lower the kohya_ss version. it worked for me I think there is some issue in latest project. I'm using vram 8gb / sdxl lora / win10

endman100 commented 1 month ago

Same issue when train flux lora in sd3 branch returned non-zero exit status 3221225477.

GPU : 4090、24GB vram RAM : 48GB RAM OS : Windows 11