kohya-ss / sd-scripts

Apache License 2.0
5.18k stars 859 forks source link

No clue how to fix this #550

Open VoidButterfly opened 1 year ago

VoidButterfly commented 1 year ago

I started to train a lora in kohya ss yesterday and it started up fine, but took too long to finish. so i decided to end it prematurely and do it over night, but last night when i tried to run it. the gui.bat wouldn't open and i couldn't even start it up again. so i decided to reinstall it again, by deleting kohya_ss master folder and running the setup.bat. it now opens fine, but when i start my training i get this error in my cmd terminal:

System Information: System: Windows, Release: 10, Version: 10.0.19045, Machine: AMD64, Processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD

Python Information: Version: 3.10.6, Implementation: CPython, Compiler: MSC v.1932 64 bit (AMD64)

Virtual Environment Information: Path: D:\Kohya\kohya_ss-master\kohya_ss-master\venv

GPU Information: Name: NVIDIA GeForce GTX 1660 SUPER, VRAM: 6144 MiB ←[33mWarning: GPU VRAM is less than 8GB and will likelly result in proper operations.←[0m

Validating that requirements are satisfied. All requirements satisfied. headless: False Load CSS... Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). Folder 100_MaisieWilliams : 1400 steps max_train_steps = 14000 stop_text_encoder_training = 0 lr_warmup_steps = 1400 accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="D:/LoraTraing/Maisie Williams/images" --resolution=512,512 --output_dir="D:/LoraTraing/Maisie Williams/model" --logging_dir="D:/LoraTraing/Maisie Williams/log" --save_model_as=safetensors --output_name="last" --max_data_loader_n_workers="0" --learning_rate="1e-05" --lr_scheduler="cosine" --lr_warmup_steps="1400" --train_batch_size="1" --max_train_steps="14000" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale prepare tokenizer prepare images. found directory D:\LoraTraing\Maisie Williams\images\100_MaisieWilliams contains 14 image files 1400 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "D:\LoraTraing\Maisie Williams\images\100_MaisieWilliams" image_count: 14 num_repeats: 100 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: MaisieWilliams caption_extension: .caption

[Dataset 0] loading image sizes. 100%|█████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 103.94it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (384, 512), count: 100 bucket 1: resolution (384, 576), count: 300 bucket 2: resolution (448, 384), count: 100 bucket 3: resolution (448, 576), count: 100 bucket 4: resolution (512, 512), count: 400 bucket 5: resolution (576, 384), count: 200 bucket 6: resolution (640, 384), count: 200 mean ar error (without repeats): 0.042837787686754406 prepare accelerator D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( Using accelerator 0.15.0 or above. loading model for process 0/1 load Diffusers pretrained models: runwayml/stable-diffusion-v1-5 safety_checker\model.safetensors not found Fetching 19 files: 100%|█████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 9491.64it/s] You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . CrossAttention.forward has been replaced to enable xformers. [Dataset 0] caching latents. 100%|██████████████████████████████████████████████████████████████████████████████████| 14/14 [00:18<00:00, 1.33s/it] prepare optimizer, data loader etc.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA SETUP: Loading binary D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... use 8-bit AdamW optimizer | {} running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 1400 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 1400 num epochs / epoch数: 10 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 14000 steps: 0%| | 0/14000 [00:00<?, ?it/s] epoch 1/10 Traceback (most recent call last): File "D:\Kohya\kohya_ss-master\kohya_ss-master\train_db.py", line 477, in train(args) File "D:\Kohya\kohya_ss-master\kohya_ss-master\train_db.py", line 314, in train noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\accelerate\utils\operations.py", line 495, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\amp\autocast_mode.py", line 12, in decorate_autocast return func(args, kwargs) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\diffusers\models\unet_2d_condition.py", line 415, in forward sample = upsample_block( File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 1281, in forward hidden_states = upsampler(hidden_states, upsample_size) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\diffusers\models\resnet.py", line 139, in forward hidden_states = self.conv(hidden_states) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 6.00 GiB total capacity; 5.27 GiB already allocated; 0 bytes free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/14000 [00:01<?, ?it/s] Traceback (most recent call last): File "C:\Users\SLC33\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\SLC33\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\Scripts\accelerate.exe__main__.py", line 7, in File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\accelerate\commands\launch.py", line 923, in launch_command simple_launcher(args) File "D:\Kohya\kohya_ss-master\kohya_ss-master\venv\lib\site-packages\accelerate\commands\launch.py", line 579, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\Kohya\kohya_ss-master\kohya_ss-master\venv\Scripts\python.exe', 'train_db.py', '--enable_bucket', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=D:/LoraTraing/Maisie Williams/images', '--resolution=512,512', '--output_dir=D:/LoraTraing/Maisie Williams/model', '--logging_dir=D:/LoraTraing/Maisie Williams/log', '--save_model_as=safetensors', '--output_name=last', '--max_data_loader_n_workers=0', '--learning_rate=1e-05', '--lr_scheduler=cosine', '--lr_warmup_steps=1400', '--train_batch_size=1', '--max_train_steps=14000', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

really not sure what all this means, I've tried to research the cause, but to no avail and I'm now stuck.

if someone can help, that would be great, thank you :)

TingTingin commented 1 year ago

If you had a gui error you should ask in the gui repo however the error you had is a OOM or out of memory error meaning you didn't have enough vram to complete whatever operation you were doing to fix you can try to reduce vram usage without seeing full settings it's hard to say exactly what to do but some tips are

Also some other tips since it seems that your newer