kohya-ss / sd-scripts

Apache License 2.0
5.04k stars 843 forks source link

fineturn sdxl does not work #718

Open vrgz2022 opened 1 year ago

vrgz2022 commented 1 year ago

The config file was not generated, and the settings were all good. Previously, 1.5 was able to run successfully, but now SDxl generates the config file and says it is ./venv/Scripts/python.exe error I manually run the command line .\venv\Scripts\python.exe can be run, but there is only one meta file, and the other one has not been generated yet

18:25:01-330823 INFO Save... 18:25:03-014774 INFO Start Finetuning... 18:25:03-016766 INFO ./venv/Scripts/python.exe finetune/merge_captions_to_metadata.py --caption_extension=".caption" "D:/share/finetune/pixar/data" "D:/share/finetune/pixar/config/meta_cap.json" --full_path 18:25:03-027730 INFO ./venv/Scripts/python.exe finetune/prepare_buckets_latents.py "D:/share/finetune/pixar/data" "D:/share/finetune/pixar/config/meta_cap.json" "D:/share/finetune/pixar/config/meta_lat.json" "F:/stable-diffusion-webui/models/Stable-diffusion/sdxl/sd_xl_base_1.0.safetensors" --batch_size=1 --max_resolution=1024,1024 --min_bucket_reso=1024 --max_bucket_reso=2048 --mixed_precision=fp16 --full_path 18:25:03-028727 INFO The command is already running. Please wait for it to finish. '.' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 18:25:03-031716 INFO image_num = 2299 18:25:03-032934 INFO repeats = 22990 18:25:03-033710 INFO max_train_steps = 91960 18:25:03-033710 INFO lr_warmup_steps = 4598 18:25:03-034706 INFO Saving training config to D:/share/finetune/pixar/model\tangbohu-pixar-sdxl_20230805-182503.json... 18:25:03-036046 INFO accelerate launch --num_cpu_threads_per_process=10 "./sdxl_train.py" --pretrained_model_name_or_path="F:/stable-diffusion-webui/models/Stable-diffusion/sdxl/sd_xl_b ase_1.0.safetensors" --in_json="D:/share/finetune/pixar/config/meta_lat.json" --train_data_dir="D:/share/finetune/pixar/data" --output_dir="D:/share/finetune/pixar/model" --logging_dir="D:/share/finetune/pixar/log" --dataset_repeats=10 --learning_rate=4e-07 --enable_bucket --resolution="1024,1024" --min_bucket_reso=1024 --max_bucket_reso=2048 --save_model_as=safetensors --output_name="tangbohu-pixar-sdxl" --no_half_vae --learning_rate="4e-07" --lr_scheduler="constant_with_warmup" --lr_warmup_steps="4598" --train_batch_size="1" --max_train_steps="91960" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="Adafactor" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --noise_offset=0.0357 --sample_sampler=euler_a --sample_prompts="D:/share/finetune/pixar/model\sample\prompt.txt" --sample_every_n_epochs="1" noise_offset is set to 0.0357 / noise_offsetが0.0357に設定されました prepare tokenizers Training with captions. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ D:\kohya\kohya_ss\sdxl_train.py:648 in │ │ │ │ 645 │ args = parser.parse_args() │ │ 646 │ args = train_util.read_config_from_file(args, parser) │ │ 647 │ │ │ ❱ 648 │ train(args) │ │ 649 │ │ │ │ D:\kohya\kohya_ss\sdxl_train.py:93 in train │ │ │ │ 90 │ │ │ │ } │ │ 91 │ │ │ │ 92 │ │ blueprint = blueprint_generator.generate(user_config, args, tokenizer=[tokenizer │ │ ❱ 93 │ │ train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint. │ │ 94 │ else: │ │ 95 │ │ train_dataset_group = train_util.load_arbitrary_dataset(args, [tokenizer1, token │ │ 96 │ │ │ │ D:\kohya\kohya_ss\library\config_util.py:426 in generate_dataset_group_by_blueprint │ │ │ │ 423 │ dataset_klass = FineTuningDataset │ │ 424 │ │ │ 425 │ subsets = [subset_klass(asdict(subset_blueprint.params)) for subset_blueprint in d │ │ ❱ 426 │ dataset = dataset_klass(subsets=subsets, asdict(dataset_blueprint.params)) │ │ 427 │ datasets.append(dataset) │ │ 428 │ │ 429 # print info │ │ │ │ D:\kohya\kohya_ss\library\train_util.py:1479 in init │ │ │ │ 1476 │ │ │ │ with open(subset.metadata_file, "rt", encoding="utf-8") as f: │ │ 1477 │ │ │ │ │ metadata = json.load(f) │ │ 1478 │ │ │ else: │ │ ❱ 1479 │ │ │ │ raise ValueError(f"no metadata / メタデータファイルがありません: {subset │ │ 1480 │ │ │ │ │ 1481 │ │ │ if len(metadata) < 1: │ │ 1482 │ │ │ │ print(f"ignore subset with '{subset.metadata_file}': no image entries fo │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: no metadata / メタデータファイルがありません: D:/share/finetune/pixar/config/meta_lat.json ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Python310\lib\runpy.py:196 in _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["main"].dict │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "main", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ C:\Python310\lib\runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ loader = loader, │ │ 84 │ │ │ │ │ package = pkg_name, │ │ 85 │ │ │ │ │ spec = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ in :7 │ │ │ │ 4 from accelerate.commands.accelerate_cli import main │ │ 5 if name == 'main': │ │ 6 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 7 │ sys.exit(main()) │ │ 8 │ │ │ │ D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:918 in launch_command │ │ │ │ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 916 │ │ sagemaker_launcher(defaults, args) │ │ 917 │ else: │ │ ❱ 918 │ │ simple_launcher(args) │ │ 919 │ │ 920 │ │ 921 def main(): │ │ │ │ D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:580 in simple_launcher │ │ │ │ 577 │ process.wait() │ │ 578 │ if process.returncode != 0: │ │ 579 │ │ if not args.quiet: │ │ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 581 │ │ else: │ │ 582 │ │ │ sys.exit(1) │ │ 583 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['D:\kohya\kohya_ss\venv\Scripts\python.exe', './sdxl_train.py', '--pretrained_model_name_or_path=F:/stable-diffusion-webui/models/Stable-diffusion/sdxl/sd_xl_base_1.0.safetensors', '--in_json=D:/share/finetune/pixar/config/meta_lat.json', '--train_data_dir=D:/share/finetune/pixar/data', '--output_dir=D:/share/finetune/pixar/model', '--logging_dir=D:/share/finetune/pixar/log', '--dataset_repeats=10', '--learning_rate=4e-07', '--enable_bucket', '--resolution=1024,1024', '--min_bucket_reso=1024', '--max_bucket_reso=2048', '--save_model_as=safetensors', '--output_name=tangbohu-pixar-sdxl', '--no_half_vae', '--learning_rate=4e-07', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=4598', '--train_batch_size=1', '--max_train_steps=91960', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=Adafactor', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--xformers', '--noise_offset=0.0357', '--sample_sampler=euler_a', '--sample_prompts=D:/share/finetune/pixar/model\sample\prompt.txt', '--sample_every_n_epochs=1']' returned non-zero exit status 1.

jndietz commented 1 year ago

I was just coming here to post this exact issue. I had to work around generating the missing metadata_cap.json issue, and got that created. However, I can't generate the metadata_lat.json.

I tried invoking finetune/prepare_buckets_latents.py manually, but it can't read or find or create the npz files, which exist and are on the same level as the image files. I'm not much of a python person otherwise I might be able to figure out what is going on. The error I'm getting is;

RuntimeError: NaN detected in latents: c:\path\to\image.jpg even though the npz files exist and are right next to the images.

Seems like a few folks have this figured out given the fine tunes checkpoints on Civitai.

deepxmatter commented 1 year ago

Can't figure this out either, any update?

jndietz commented 1 year ago

Spent more time messing with this last night, and rralized that both metadata_cap.json and metadata_lat.json have the same contents which doesn't seem right, but I'm also still running into the issue I outlined above with the *.npz files.

2kpr commented 1 year ago

Do you all happen to be using fp16 when training?

If so, give bf16 a try and see if that helps any, or if you still want to use fp16 (as it does have more precision but less dynamic range than bf16) you might try using this VAE when doing so: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix

I saw others trying to train SDXL with fp16 getting: "NaN detected in latents"

jndietz commented 1 year ago

@2kpr I think that resolved the issue. Thanks for taking the time to respond and suggest that. the OP's original error is around the missing metadata_lat.json, which can be solved by manually invoking merge_captions_to_metadata.py:

python.exe finetune/merge_captions_to_metadata.py --caption_extension=.txt "C:\training\your-model-folder\img" "C:\training\your-model-folder\configs\metadata_cap.json" --recursive --full_path`

Afterwards, you can run the prepare_bucket_latents.py script:

python.exe finetune/prepare_buckets_latents.py "C:\training\your-model-folder\img" "C:\training\your-model-folder\configs/metadata_cap.json" "C:\training\your-model-folder\configs/metadata_lat.json" "C:/github/stable-diffusion-webui/models/Stable-diffusion/some-model-here.safetensors" --batch_size=1 --max_resolution=1024,1024 --min_bucket_reso=256 --max_bucket_reso=2048 --mixed_precision=bf16

Finally, start the fine-tuning process:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --pretrained_model_name_or_path="C:/github/stable-diffusion-webui/models/Stable-diffusion/some-model-here.safetensors" --in_json="C:\training\your-model-folder\configs/metadata_lat.json" --train_data_dir="C:\training\your-model-folder\img" --output_dir="C:\training\your-model-folder\output" --logging_dir="C:\training\your-model-folder\logging" --dataset_repeats=5 --learning_rate=1.0 --enable_bucket --resolution="1024,1024" --min_bucket_reso=256 --max_bucket_reso=2048 --save_model_as=safetensors --output_name="some-model-output-name" --cache_text_encoder_outputs --no_half_vae --learning_rate="1.0" --lr_scheduler="cosine_with_restarts" --train_batch_size="1" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Prodigy" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --min_snr_gamma=5 --xformers --bucket_no_upscale --noise_offset=0.05 --adaptive_noise_scale=0.005

This actually shows me the fine tuning process working now.

vrgz2022 commented 1 year ago

I just update to new version ,and now problem is gone!

Fade07w commented 1 year ago

I'm having this issue on a brand new install of everything what was the solution for this:?

northeastsquare commented 12 months ago

add --no_half_vae, as https://github.com/kohya-ss/sd-scripts/issues/636#issuecomment-1633932885 solve my problem