subprocess.CalledProcessError when running example notebook for dreambooth+lora

nvnarayna commented 1 year ago

Describe the bug

subprocess.CalledProcessError occurs when running an example notebook https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb in colab free tier with t4 runtime.

Reproduction

run the notebook till training part. https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb

Logs

10/19/2023 14:54:26 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'thresholding', 'variance_type'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type'} was not found in config. Values will be initialized to default values.
2023-10-19 14:56:06.660628: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
10/19/2023 14:56:09 - INFO - __main__ - ***** Running training *****
10/19/2023 14:56:09 - INFO - __main__ -   Num examples = 5
10/19/2023 14:56:09 - INFO - __main__ -   Num batches each epoch = 3
10/19/2023 14:56:09 - INFO - __main__ -   Num Epochs = 250
10/19/2023 14:56:09 - INFO - __main__ -   Instantaneous batch size per device = 2
10/19/2023 14:56:09 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
10/19/2023 14:56:09 - INFO - __main__ -   Gradient Accumulation steps = 2
10/19/2023 14:56:09 - INFO - __main__ -   Total optimization steps = 500
Steps:   0% 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/train_dreambooth_lora_sdxl.py", line 1366, in <module>
    main(args)
  File "/content/train_dreambooth_lora_sdxl.py", line 1106, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 636, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 624, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py", line 1036, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1138, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 322, in forward
    hidden_states = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 458, in checkpoint
    ret = function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 239, in forward
    attn_output = self.attn2(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 428, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 953, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 223, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 326, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 42, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 348, in _memory_efficient_attention_forward_requires_grad
    inp.validate_inputs()
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/common.py", line 121, in validate_inputs
    raise ValueError(
ValueError: Query/Key/Value should either all have the same dtype, or (in the quantized case) Key/Value should have dtype torch.int32
  query.dtype: torch.float32
  key.dtype  : torch.float16
  value.dtype: torch.float16
Steps:   0% 0/500 [00:15<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=dog', '--output_dir=lora-trained-xl-colab', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=2', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--enable_xformers_memory_efficient_attention', '--mixed_precision=fp16', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--max_train_steps=500', '--checkpointing_steps=717', '--seed=0', '--push_to_hub']' returned non-zero exit status 1.

System Info

google colaboratory

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 1 year ago

See https://github.com/huggingface/diffusers/issues/5368#issuecomment-1763805970

nvnarayna commented 1 year ago

that fixed that issue but gave arise to a new 1:

Loading pipeline components...:  14% 1/7 [00:03<00:18,  3.05s/it]Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
{'attention_type', 'dropout'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=dog', '--output_dir=lora-trained-xl-colab', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=2', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--mixed_precision=fp16', '--use_8bit_adam', '--max_train_steps=500', '--checkpointing_steps=717', '--seed=0', '--push_to_hub']' died with <Signals.SIGKILL: 9>.

but now i have access to file named "pytorch_lora_weights.safetensors" i try to load this from hugging face models it gives arise to error saying OSError: rando2625/lora-trained-xl-colab does not appear to have a file named pytorch_lora_weights.bin.

sayakpaul commented 1 year ago

The stacktrace doesn't seem to suggest anything wrong in the script, though.

but now i have access to file named "pytorch_lora_weights.safetensors" i try to load this from hugging face models it gives arise to error saying OSError: rando2625/lora-trained-xl-colab does not appear to have a file named pytorch_lora_weights.bin.

How are you running inference?

nvnarayna commented 1 year ago

yeah the problem was it didnt automatically commit to hugging face i did it manually nothing was wrong my bad sry

thanks sayak

huggingface / diffusers