kohya-ss / sd-scripts

Apache License 2.0
5.04k stars 843 forks source link

Using alpha_mask has number of dimensions mismatch #1351

Open rockerBOO opened 4 months ago

rockerBOO commented 4 months ago

When using --alpha_mask with images with the background removed using rembg. On commit 0d96e10b3e66d5c6c7096fbeb7626c5be2e98809


Traceback (most recent call last):
  File "/mnt/900/builds/sd-scripts/train_network.py", line 1156, in <module>
    trainer.train(args)
  File "/mnt/900/builds/sd-scripts/train_network.py", line 919, in train
    loss = apply_masked_loss(loss, batch)
  File "/mnt/900/builds/sd-scripts/library/custom_train_functions.py", line 497, in apply_masked_loss
    mask_image = torch.nn.functional.interpolate(mask_image, size=loss.shape[2:], mode="area")
  File "/mnt/900/builds/sd-scripts/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3961, in interpolate
    raise ValueError(
ValueError: Input and output must have the same number of spatial dimensions, but got input with spatial dimensions of [1, 576, 960] and output size of torch.Size([72, 120]). Please provide input tensor in (N, C, d1, d2, ...,dK) format and output size in (o1, o2, ...,oK) format.

uncommented print line for context:

mask_image: torch.Size([2, 1, 1, 576, 960]), 0.6733689904212952

If I swap the following lines it works.

        mask_image = batch["alpha_masks"].to(dtype=loss.dtype).unsqueeze(1) # add channel dimension
        # mask_image = batch["alpha_masks"].to(dtype=loss.dtype) 

It does seem to work, to a degree, with the lines swapped.

I have seen others get it to work without having to modify this line so maybe some interaction with the dataset and the alpha_mask. Would be happy to try to isolate this.

araleza commented 4 months ago

Maybe post your command line? I got the alpha mask parameter working, so you could compare yours to mine:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --pretrained_model_name_or_path="/home/ara/Documents/Dev/sdxl/training/earthscape/kohya/dreambooth/earthscape-step00002600.safetensors" --sdpa --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --train_data_dir="/home/ara/Documents/Dev/sdxl/training/earthscape/kohya/img" --resolution="1024,1024" --output_dir="/home/ara/Documents/Dev/sdxl/training/earthscape/kohya/dreambooth" --logging_dir="/home/ara/Documents/Dev/sdxl/training/earthscape/kohya/log" --save_model_as=safetensors --vae="/home/ara/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="earthscape" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --lr_scheduler="constant_with_warmup" --lr_warmup_steps="100" --max_train_steps="16000" --caption_extension=".txt" --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --v_pred_like_loss="0.5" --save_every_n_steps="200" --save_last_n_steps="600" --min_snr_gamma=5 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --adaptive_noise_scale=0.00357 --sample_sampler=k_dpm_2 --sample_prompts="/home/ara/Documents/Dev/sdxl/training/earthscape/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="50" --fused_backward_pass --cache_latents --loss_type=huber --train_batch_size="4" --train_text_encoder --learning_rate_te1 1e-9 --learning_rate_te2 0 --learning_rate="4e-7" --flip_aug --enable_wildcard --shuffle_caption --alpha_mask

On a different topic, I wouldn't have high hopes for using background removal (i.e. with a tight boundary with the person you're training on) with alpha_mask. The problem I find is that since the background could be anything, the trained network starts generating multiple extra legs etc. as that raises the chance of getting a leg in the unmasked region, and the extra legs are not trained away by the gradient, as they end up in the masked background regions.

Best to leave some background unmasked in the area around the person, especially in the lower part of the image where the arms and legs are.

rockerBOO commented 4 months ago

My current config. File paths and such are separately set. Using lycoris but i wouldn't think that would affect it.

pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"
max_train_epochs=4
save_every_n_epochs=1
train_batch_size=1
gradient_accumulation_steps=2
sample_every_n_epochs=1 
sample_sampler="dpmsolver++"
caption_extension=".txt"
ip_noise_gamma=0.1
noise_offset=0.1
adaptive_noise_scale=0.01
noise_offset_random_strength=true
ip_noise_gamma_random_strength=true
gradient_checkpointing=true
alpha_mask=true
network_dim=10
network_alpha=5
network_module = "lycoris.kohya"
network_args = [ 
  "algo=boft",
  "rescale=True",
  "constrain=1e-4",
  "dropout=0.3",
  "rank_dropout=0.15",
  "module_dropout=0.15",
]

debiased_estimation_loss=true
sdpa=true
seed=13337

save_model_as="safetensors"
training_comment="Trained by: rockerBOO"
mixed_precision="fp16"

optimizer_type="PagedAdamW32Bit"
unet_lr=1e-4
text_encoder_lr=5e-5
optimizer_args=["weight_decay=0.01", "betas=(0.9,0.999)"]

loss_type="huber"
huber_schedule="snr" # exponential, constant, or snr. 
huber_c=0.1

log_with = "wandb"

Dataset config:

[general]
shuffle_caption = true
caption_extension = '.txt'

enable_bucket = true
bucket_reso_steps = 64

[[datasets]]
resolution = 768
...

In terms of the masking I do the automated rembg mask and then manually filter out bad results. I have gotten very good results (in some cases the best results) with my current test but the backgrounds are not great. I think mixing in non-alpha results could prove to be even better. I certainly would iterate on my masking technique but for this current test it balances how fast it is, with decent to great results.

Though at this point just curious if I'm messing it up by not adding that additional channel. But also why it is not working correctly. Maybe something SD1.5 related?

u-haru commented 4 months ago

I checked it and also it doesn't work because of the dimension. Alpha_mask already has 3 dimensions (1 x W x H) and doesn't need to be unsqueezed.

Also, there is a problem when using flip_arg with cache_latents. The conversion using transforms.ToTensor() is missing in cache_batch_latents, so the training failed as follows:

  File "E:\LoRA-Scripts\library\train_util.py", line 1207, in __getitem__
    alpha_mask = None if image_info.alpha_mask is None else torch.flip(image_info.alpha_mask, [1])
TypeError: flip(): argument 'input' (position 1) must be Tensor, not numpy.ndarray

Add alpha_mask = transforms.ToTensor()(alpha_mask) to fix.

So, here are all the fixes:

diff --git a/library/custom_train_functions.py b/library/custom_train_functions.py
index 2a513dc..37680b1 100644
--- a/library/custom_train_functions.py
+++ b/library/custom_train_functions.py
@@ -487,7 +487,7 @@ def apply_masked_loss(loss, batch):
         # print(f"conditioning_image: {mask_image.shape}")
     elif "alpha_masks" in batch and batch["alpha_masks"] is not None:
         # alpha mask is 0 to 1
-        mask_image = batch["alpha_masks"].to(dtype=loss.dtype).unsqueeze(1) # add channel dimension
+        mask_image = batch["alpha_masks"].to(dtype=loss.dtype) # add channel dimension
         # print(f"mask_image: {mask_image.shape}, {mask_image.mean()}")
     else:
         return loss
diff --git a/library/train_util.py b/library/train_util.py
index 1f9f3c5..5795f86 100644
--- a/library/train_util.py
+++ b/library/train_util.py
@@ -2498,6 +2498,7 @@ def cache_batch_latents(
                 alpha_mask = alpha_mask.astype(np.float32) / 255.0
             else:
                 alpha_mask = np.ones_like(image[:, :, 0], dtype=np.float32)
+            alpha_mask = transforms.ToTensor()(alpha_mask)
         else:
             alpha_mask = None
         alpha_masks.append(alpha_mask)
kohya-ss commented 4 months ago

Thank you for reporting the issue. It was due to the confusion between ndarray and Tensor. Sorry for the lack of testing. It should work in all combinations with and without cache, with and without disk cache, with and without flip.