Given groups=1, weight of size [1536, 16, 2, 2], expected input[4, 4, 128, 96] to have 16 channels, but got 4 channels instead

Loading settings from /content/fine_tune/config/config_file.toml... /content/fine_tune/config/config_file You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Training with captions. loading existing metadata: /content/fine_tune/meta_lat.json using bucket info in metadata / メタデータ内のbucket情報を使います [Dataset 0] batch_size: 4 resolution: (1024, 1024) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: None max_bucket_reso: None bucket_reso_steps: None bucket_no_upscale: None

[Subset 0 of Dataset 0] image_dir: "/content/fine_tune/train_data" image_count: 30 num_repeats: 20 shuffle_caption: False keep_tokens: 0 keep_tokens_separator: caption_separator: , secondary_separator: None enable_wildcard: False caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, alpha_mask: False, metadata_file: /content/fine_tune/meta_lat.json

[Dataset 0] loading image sizes. 100% 30/30 [00:00<00:00, 691368.79it/s] make buckets number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） bucket 0: resolution (512, 1024), count: 60 bucket 1: resolution (576, 1024), count: 280 bucket 2: resolution (704, 1024), count: 40 bucket 3: resolution (768, 1024), count: 100 bucket 4: resolution (832, 1024), count: 40 bucket 5: resolution (1024, 704), count: 20 bucket 6: resolution (1024, 768), count: 20 bucket 7: resolution (1024, 1024), count: 40 mean ar error (without repeats): 0.0 prepare accelerator accelerator device: cuda Loading SD3 models from /content/pretrained_model/sd3_medium.safetensors loading model for process 0/1 Building VAE Loading state dict... Loaded VAE: [Dataset 0] caching latents. checking cache validity... 100% 30/30 [00:00<00:00, 554313.30it/s] caching latents... 0it [00:00, ?it/s] loading model for process 0/1 Loading clip_l from /content/pretrained_model/clip_l.safetensors... Building ClipL Loading state dict... Loaded ClipL: loading model for process 0/1 Loading clip_g from /content/pretrained_model/clip_g.safetensors... Building ClipG Loading state dict... Loaded ClipG: loading model for process 0/1 Loading t5xxl from /content/pretrained_model/t5xxl_fp16.safetensors... Building T5XXL Loading state dict... Loaded T5XXL: [Dataset 0] caching text encoder outputs. checking cache existence... 100% 30/30 [00:00<00:00, 134146.18it/s] caching text encoder outputs... 0it [00:00, ?it/s] loading model for process 0/1 Building MMDit Loading state dict... Loaded MMDiT: train mmdit: True number of models: 1 number of trainable parameters: 2028328000 prepare optimizer, data loader etc. use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False} constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません running training / 学習開始 num examples / サンプル数: 600 num batches per epoch / 1epochのバッチ数: 150 num epochs / epoch数: 53 batch size per device / バッチサイズ: 4 gradient accumulation steps / 勾配を合計するステップ数 = 4 total optimization steps / 学習ステップ数: 2014 steps: 0% 0/2014 [00:00<?, ?it/s] epoch 1/53 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 epoch is incremented. current_epoch: 0, epoch: 1 Traceback (most recent call last): File "/content/kohya-trainer/sd3_train.py", line 974, in train(args) File "/content/kohya-trainer/sd3_train.py", line 750, in train model_pred = mmdit(noisy_model_input, timesteps, context=context, y=pool) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 680, in forward return model_forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 668, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, *kwargs) File "/content/kohya-trainer/library/sd3_models.py", line 998, in forward x = self.x_embedder(x) + self.cropped_pos_embed(H, W, device=x.device).to(dtype=x.dtype) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/content/kohya-trainer/library/sd3_models.py", line 298, in forward x = self.proj(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 460, in forward return self._conv_forward(input, self.weight, self.bias) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [1536, 16, 2, 2], expected input[4, 4, 128, 96] to have 16 channels, but got 4 channels instead steps: 0% 0/2014 [00:01<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'sd3_train.py', '--sample_prompts=/content/fine_tune/config/sample_prompt.toml', '--config_file=/content/fine_tune/config/config_file.toml', '--clip_l=/content/pretrained_model/clip_l.safetensors', '--clip_g=/content/pretrained_model/clip_g.safetensors', '--t5xxl=/content/pretrained_model/t5xxl_fp16.safetensors', '--t5xxl_dtype=fp16']' returned non-zero exit status 1.

kohya-ss / sd-scripts

Given groups=1, weight of size [1536, 16, 2, 2], expected input[4, 4, 128, 96] to have 16 channels, but got 4 channels instead #1419