kohya-ss / sd-scripts

Apache License 2.0
5.13k stars 855 forks source link

SDXL LoRA Training Memory requirement increased after updates #661

Closed Arron17 closed 1 year ago

Arron17 commented 1 year ago

When using commit - 747af145ed32eb85205dca144a4e49f25032d130

I am able to train on a 3080 10GB Card without issues.

After updating to the latest commit, I get out of memory issues on every try.

I've even tried to lower the image resolution to very small values like 256x256 and I get the same out of memory errors on the GPU.

I believe something has changed between then that has caused this regression. One major thing that seems to have changed is that teh newer version uses the StableDiffusionXLPipeline, whereas the old commit does not, could this be part of the issue?

FurkanGozukara commented 1 year ago

how did you even manage to train with 10gb?

i tried on 12gb card always out of memory error

what was your settings?

kohya-ss commented 1 year ago

Hi, I've merged the PR #645, and I believe the latest version will work on 10GB VRAM with fp16/bf16. However, please disable sample generations during training when fp16. It takes a lot of vram.

In addition, I think it may work either on 8GB VRAM.

FurkanGozukara commented 1 year ago

Hi, I've merged the PR #645, and I believe the latest version will work on 10GB VRAM with fp16/bf16. However, please disable sample generations during training when fp16. It takes a lot of vram.

In addition, I think it may work either on 8GB VRAM.

nice

Arron17 commented 1 year ago

Started an attempt with the newest commit, looking good so far, I'll report back if it breaks during.

Arron17 commented 1 year ago

Seems all good. Looks like the PR has resolved the issue

FurkanGozukara commented 1 year ago

Seems all good. Looks like the PR has resolved the issue

what are your settings?

i still get OOM with 12 gb

FurkanGozukara commented 1 year ago

Hi, I've merged the PR #645, and I believe the latest version will work on 10GB VRAM with fp16/bf16. However, please disable sample generations during training when fp16. It takes a lot of vram.

In addition, I think it may work either on 8GB VRAM.

hello

i tested many optimizers with new commit but still getting oom

any idea why could be?

I am testing on my second gpu which is 100% empty RTX 3060 12 GB

00:31:52-081849 INFO     Start training LoRA Standard ...
00:31:52-082848 INFO     Valid image folder names found in: F:/kohya sdxl tutorial files\img
00:31:52-083848 INFO     Valid image folder names found in: F:/kohya sdxl tutorial files\reg
00:31:52-084848 INFO     Folder 20_ohwx man: 13 images found
00:31:52-085848 INFO     Folder 20_ohwx man: 260 steps
00:31:52-085848 INFO     [94mRegularisation images are used... Will double the number of steps required...[0m
00:31:52-086848 INFO     Total steps: 260
00:31:52-087847 INFO     Train batch size: 1
00:31:52-087847 INFO     Gradient accumulation steps: 1.0
00:31:52-088848 INFO     Epoch: 10
00:31:52-089848 INFO     Regulatization factor: 2
00:31:52-090848 INFO     max_train_steps (260 / 1 / 1.0 * 10 * 2) = 5200
00:31:52-091849 INFO     stop_text_encoder_training = 0
00:31:52-092848 INFO     lr_warmup_steps = 0
00:31:52-092848 INFO     Saving training config to F:/kohya sdxl tutorial files\model\tutorial_video_20230720-003152.json...
00:31:52-095848 INFO     accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
                         --pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:/kohya sdxl tutorial files\img" --reg_data_dir="F:/kohya sdxl tutorial
                         files\reg" --resolution="1024,1024" --output_dir="F:/kohya sdxl tutorial files\model" --logging_dir="F:/kohya sdxl tutorial files\log" --network_alpha="1"
                         --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="tutorial_video"
                         --lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5200"
                         --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False
                         warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --xformers --bucket_no_upscale

error below

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\97 kohya\kohya_ss\sdxl_train_network.py:174 in <module>                                       │
│                                                                                                  │
│   171 │   args = train_util.read_config_from_file(args, parser)                                  │
│   172 │                                                                                          │
│   173 │   trainer = SdxlNetworkTrainer()                                                         │
│ ❱ 174 │   trainer.train(args)                                                                    │
│   175                                                                                            │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\train_network.py:735 in train                                               │
│                                                                                                  │
│   732 │   │   │   │   │   │   │   latents = batch["latents"].to(accelerator.device)              │
│   733 │   │   │   │   │   │   else:                                                              │
│   734 │   │   │   │   │   │   │   # latentに変換                                                 │
│ ❱ 735 │   │   │   │   │   │   │   latents = vae.encode(batch["images"].to(dtype=vae_dtype)).la   │
│   736 │   │   │   │   │   │   │                                                                  │
│   737 │   │   │   │   │   │   │   # NaNが含まれていれば警告を表示し0に置き換える                 │
│   738 │   │   │   │   │   │   │   if torch.any(torch.isnan(latents)):                            │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\utils\accelerate_utils.py:46 in wrapper    │
│                                                                                                  │
│   43 │   def wrapper(self, *args, **kwargs):                                                     │
│   44 │   │   if hasattr(self, "_hf_hook") and hasattr(self._hf_hook, "pre_forward"):             │
│   45 │   │   │   self._hf_hook.pre_forward(self)                                                 │
│ ❱ 46 │   │   return method(self, *args, **kwargs)                                                │
│   47 │                                                                                           │
│   48 │   return wrapper                                                                          │
│   49                                                                                             │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\autoencoder_kl.py:236 in encode     │
│                                                                                                  │
│   233 │   │   │   encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]             │
│   234 │   │   │   h = torch.cat(encoded_slices)                                                  │
│   235 │   │   else:                                                                              │
│ ❱ 236 │   │   │   h = self.encoder(x)                                                            │
│   237 │   │                                                                                      │
│   238 │   │   moments = self.quant_conv(h)                                                       │
│   239 │   │   posterior = DiagonalGaussianDistribution(moments)                                  │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl        │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\vae.py:139 in forward               │
│                                                                                                  │
│   136 │   │   else:                                                                              │
│   137 │   │   │   # down                                                                         │
│   138 │   │   │   for down_block in self.down_blocks:                                            │
│ ❱ 139 │   │   │   │   sample = down_block(sample)                                                │
│   140 │   │   │                                                                                  │
│   141 │   │   │   # middle                                                                       │
│   142 │   │   │   sample = self.mid_block(sample)                                                │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl        │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py:1150 in forward   │
│                                                                                                  │
│   1147 │                                                                                         │
│   1148 │   def forward(self, hidden_states):                                                     │
│   1149 │   │   for resnet in self.resnets:                                                       │
│ ❱ 1150 │   │   │   hidden_states = resnet(hidden_states, temb=None)                              │
│   1151 │   │                                                                                     │
│   1152 │   │   if self.downsamplers is not None:                                                 │
│   1153 │   │   │   for downsampler in self.downsamplers:                                         │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl        │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\resnet.py:598 in forward            │
│                                                                                                  │
│   595 │   │   else:                                                                              │
│   596 │   │   │   hidden_states = self.norm1(hidden_states)                                      │
│   597 │   │                                                                                      │
│ ❱ 598 │   │   hidden_states = self.nonlinearity(hidden_states)                                   │
│   599 │   │                                                                                      │
│   600 │   │   if self.upsample is not None:                                                      │
│   601 │   │   │   # upsample_nearest_nhwc fails with large batch sizes. see https://github.com   │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl        │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\activation.py:396 in forward        │
│                                                                                                  │
│    393 │   │   self.inplace = inplace                                                            │
│    394 │                                                                                         │
│    395 │   def forward(self, input: Tensor) -> Tensor:                                           │
│ ❱  396 │   │   return F.silu(input, inplace=self.inplace)                                        │
│    397 │                                                                                         │
│    398 │   def extra_repr(self) -> str:                                                          │
│    399 │   │   inplace_str = 'inplace=True' if self.inplace else ''                              │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\functional.py:2059 in silu                  │
│                                                                                                  │
│   2056 │   │   return handle_torch_function(silu, (input,), input, inplace=inplace)              │
│   2057 │   if inplace:                                                                           │
│   2058 │   │   return torch._C._nn.silu_(input)                                                  │
│ ❱ 2059 │   return torch._C._nn.silu(input)                                                       │
│   2060                                                                                           │
│   2061                                                                                           │
│   2062 def mish(input: Tensor, inplace: bool = False) -> Tensor:                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 12.00 GiB total capacity; 11.01 GiB already allocated; 0 bytes free; 11.24 GiB reserved in total by PyTorch) If
reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps:   0%|                                                                                                                                                            | 0/5200 [00:24<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Python3108\lib\runpy.py:196 in _run_module_as_main                                            │
│                                                                                                  │
│   193 │   main_globals = sys.modules["__main__"].__dict__                                        │
│   194 │   if alter_argv:                                                                         │
│   195 │   │   sys.argv[0] = mod_spec.origin                                                      │
│ ❱ 196 │   return _run_code(code, main_globals, None,                                             │
│   197 │   │   │   │   │    "__main__", mod_spec)                                                 │
│   198                                                                                            │
│   199 def run_module(mod_name, init_globals=None,                                                │
│                                                                                                  │
│ C:\Python3108\lib\runpy.py:86 in _run_code                                                       │
│                                                                                                  │
│    83 │   │   │   │   │      __loader__ = loader,                                                │
│    84 │   │   │   │   │      __package__ = pkg_name,                                             │
│    85 │   │   │   │   │      __spec__ = mod_spec)                                                │
│ ❱  86 │   exec(code, run_globals)                                                                │
│    87 │   return run_globals                                                                     │
│    88                                                                                            │
│    89 def _run_module_code(code, init_globals=None,                                              │
│                                                                                                  │
│ in <module>:7                                                                                    │
│                                                                                                  │
│   4 from accelerate.commands.accelerate_cli import main                                          │
│   5 if __name__ == '__main__':                                                                   │
│   6 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 7 │   sys.exit(main())                                                                         │
│   8                                                                                              │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in main     │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:918 in launch_command  │
│                                                                                                  │
│   915 │   elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA   │
│   916 │   │   sagemaker_launcher(defaults, args)                                                 │
│   917 │   else:                                                                                  │
│ ❱ 918 │   │   simple_launcher(args)                                                              │
│   919                                                                                            │
│   920                                                                                            │
│   921 def main():                                                                                │
│                                                                                                  │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:580 in simple_launcher │
│                                                                                                  │
│   577 │   process.wait()                                                                         │
│   578 │   if process.returncode != 0:                                                            │
│   579 │   │   if not args.quiet:                                                                 │
│ ❱ 580 │   │   │   raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)    │
│   581 │   │   else:                                                                              │
│   582 │   │   │   sys.exit(1)                                                                    │
│   583                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['D:\\97 kohya\\kohya_ss\\venv\\Scripts\\python.exe', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048',
'--pretrained_model_name_or_path=F:/0 models/sd_xl_base_0.9.safetensors', '--train_data_dir=F:/kohya sdxl tutorial files\\img', '--reg_data_dir=F:/kohya sdxl tutorial files\\reg',
'--resolution=1024,1024', '--output_dir=F:/kohya sdxl tutorial files\\model', '--logging_dir=F:/kohya sdxl tutorial files\\log', '--network_alpha=1', '--save_model_as=safetensors',
'--network_module=networks.lora', '--text_encoder_lr=0.0004', '--unet_lr=0.0004', '--network_dim=256', '--output_name=tutorial_video', '--lr_scheduler_num_cycles=10', '--no_half_vae',
'--full_bf16', '--learning_rate=0.0004', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=5200', '--save_every_n_epochs=1', '--mixed_precision=bf16',
'--save_precision=bf16', '--optimizer_type=Adafactor', '--optimizer_args', 'scale_parameter=False', 'relative_step=False', 'warmup_init=False', '--max_data_loader_n_workers=0',
'--bucket_reso_steps=64', '--mem_eff_attn', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.
kohya-ss commented 1 year ago

Please use --cache_latents (and additionally --cache_latents_to_disk) option. This option makes VAE unnecessary during the training and reduces memory usage.

FurkanGozukara commented 1 year ago

Please use --cache_latents (and additionally --cache_latents_to_disk) option. This option makes VAE unnecessary during the training and reduces memory usage.

thank you so much for reply

here 2 more testing i have done. I am testing on 0 memory usage RTX 3060 12 GB - my second GPU

this below command giving this error : AssertionError: network for Text Encoder cannot be trained with caching Text Encoder outputs

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
                         --pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:/kohya sdxl tutorial files\img" --reg_data_dir="F:/kohya sdxl tutorial
                         files\reg" --resolution="1024,1024" --output_dir="F:/kohya sdxl tutorial files\model" --logging_dir="F:/kohya sdxl tutorial files\log" --network_alpha="1"
                         --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="tutorial_video"
                         --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1"
                         --max_train_steps="5200" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor"
                         --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn
                         --gradient_checkpointing --xformers --bucket_no_upscale

And this below command still giving out of vram error for RTX 3060 - 12 GB - system ram is 64 gb

  accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
                         --pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:/kohya sdxl tutorial files\img" --reg_data_dir="F:/kohya sdxl tutorial
                         files\reg" --resolution="1024,1024" --output_dir="F:/kohya sdxl tutorial files\model" --logging_dir="F:/kohya sdxl tutorial files\log" --network_alpha="1"
                         --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="tutorial_video"
                         --lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5200"
                         --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args
                         scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers
                         --bucket_no_upscale
kohya-ss commented 1 year ago

Please specify --network_train_unet_only if you caching the text encoder outputs.

For the second command, if you don't use the option --cache_text_encoder_outputs, Text Encoders are on VRAM, and it uses a lot of VRAM. So please add the option (and also add --network_train_unet_only).

evanheckert commented 1 year ago

For what it's worth, even with --cache_text_encoder_outputs and --network_train_unet_only I still run out of memory on a 8GB RTX 3070.

accelerate launch
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
 --enable_bucket
 --min_bucket_reso=256
 --max_bucket_reso=2048
 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
 --train_data_dir="snip"
 --reg_data_dir="snip"
 --resolution="1024,1024"
 --output_dir="snip"
 --logging_dir="snip"
 --network_alpha="1"
 --save_model_as=safetensors
 --network_module=networks.lora
 --text_encoder_lr=0.0004
 --unet_lr=0.0004
 --network_dim=128
 --output_name="subjectXL"
 --lr_scheduler_num_cycles="10"
 --cache_text_encoder_outputs
 --no_half_vae
 --full_bf16
 --learning_rate="0.0004"
 --lr_scheduler="constant"
 --train_batch_size="1"
 --max_train_steps="11400"
 --save_every_n_epochs="1"
 --mixed_precision="bf16"
 --save_precision="bf16"
 --cache_latents
 --cache_latents_to_disk
 --optimizer_type="Adafactor"
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False
 --max_data_loader_n_workers="0"
 --bucket_reso_steps=64
 --mem_eff_attn
 --gradient_checkpointing
 --xformers
 --bucket_no_upscale
 --network_train_unet_only
kohya-ss commented 1 year ago

128 for network_dim seems too large. 4 or 8 will work.

FurkanGozukara commented 1 year ago

For what it's worth, even with --cache_text_encoder_outputs and --network_train_unet_only I still run out of memory on a 8GB RTX 3070.

accelerate launch
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
 --enable_bucket
 --min_bucket_reso=256
 --max_bucket_reso=2048
 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
 --train_data_dir="snip"
 --reg_data_dir="snip"
 --resolution="1024,1024"
 --output_dir="snip"
 --logging_dir="snip"
 --network_alpha="1"
 --save_model_as=safetensors
 --network_module=networks.lora
 --text_encoder_lr=0.0004
 --unet_lr=0.0004
 --network_dim=128
 --output_name="subjectXL"
 --lr_scheduler_num_cycles="10"
 --cache_text_encoder_outputs
 --no_half_vae
 --full_bf16
 --learning_rate="0.0004"
 --lr_scheduler="constant"
 --train_batch_size="1"
 --max_train_steps="11400"
 --save_every_n_epochs="1"
 --mixed_precision="bf16"
 --save_precision="bf16"
 --cache_latents
 --cache_latents_to_disk
 --optimizer_type="Adafactor"
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False
 --max_data_loader_n_workers="0"
 --bucket_reso_steps=64
 --mem_eff_attn
 --gradient_checkpointing
 --xformers
 --bucket_no_upscale
 --network_train_unet_only

someone messaged me and this config worked for 12 gb

hopefully i will do a test asap and let you guys know with my rtx 3060

Thomface commented 1 year ago

Training works successfully when following above advices (network_dim 8, Unet only) on an RTX 3060ti (ie 8Go VRAM).

launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/ai/models/Stable-diffusion/sd_xl_base_1.0.safetensors" --train_data_dir="C:/ai/lora/Training/img" --reg_data_dir="C:/ai/lora/Training/regularization" --resolution="1024,1024" --output_dir="C:/ai/models/Lora" --logging_dir="C:/ai/lora/Training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=8 --output_name="sdxlcath" --lr_scheduler_num_cycles="1" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5100" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only

2kpr commented 1 year ago

Just wanted to report training with the text encoder working on a 3080 Ti 12GB GPU.

If I disable the text encoder training I can up the network_dim to 256, but if I enable the text encoder training I had to lower the network_dim to 32, I'm just happy I have the option now to train with or without training the text encoder on my 12GB GPU :)

subprocess.run([
    "accelerate",
    "launch",
    "--num_cpu_threads_per_process=8",
    "./sdxl_train_network.py",
    "--enable_bucket",
    "--min_bucket_reso=256",
    "--max_bucket_reso=2048",
    "--pretrained_model_name_or_path=models/sd_xl_base_1.0.safetensors",
    "--train_data_dir=traintest",
    "--resolution=1024,1024",
    "--output_dir=traintest",
    "--logging_dir=traintest",
    "--network_alpha=1",
    "--save_model_as=safetensors",
    "--network_module=networks.lora",
    "--text_encoder_lr=0.0004",
    "--unet_lr=0.0004",
    "--network_dim=32",
    "--output_name=traintest",
    "--lr_scheduler_num_cycles=10",
    "--no_half_vae",
    "--full_bf16",
    "--learning_rate=0.0004",
    "--lr_scheduler=constant",
    "--train_batch_size=1",
    "--max_train_steps=1000",
    "--save_every_n_epochs=1",
    "--mixed_precision=bf16",
    "--save_precision=bf16",
    "--cache_latents",
    "--cache_latents_to_disk",
    "--optimizer_type=Adafactor",
    "--optimizer_args",
    "scale_parameter=False",
    "relative_step=False",
    "warmup_init=False",
    "--max_data_loader_n_workers=0",
    "--bucket_reso_steps=64",
    "--mem_eff_attn",
    "--gradient_checkpointing",
    "--xformers",
    "--bucket_no_upscale"
])
Zeyis commented 1 year ago

For what it's worth, even with --cache_text_encoder_outputs and --network_train_unet_only I still run out of memory on a 8GB RTX 3070.

accelerate launch
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
 --enable_bucket
 --min_bucket_reso=256
 --max_bucket_reso=2048
 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
 --train_data_dir="snip"
 --reg_data_dir="snip"
 --resolution="1024,1024"
 --output_dir="snip"
 --logging_dir="snip"
 --network_alpha="1"
 --save_model_as=safetensors
 --network_module=networks.lora
 --text_encoder_lr=0.0004
 --unet_lr=0.0004
 --network_dim=128
 --output_name="subjectXL"
 --lr_scheduler_num_cycles="10"
 --cache_text_encoder_outputs
 --no_half_vae
 --full_bf16
 --learning_rate="0.0004"
 --lr_scheduler="constant"
 --train_batch_size="1"
 --max_train_steps="11400"
 --save_every_n_epochs="1"
 --mixed_precision="bf16"
 --save_precision="bf16"
 --cache_latents
 --cache_latents_to_disk
 --optimizer_type="Adafactor"
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False
 --max_data_loader_n_workers="0"
 --bucket_reso_steps=64
 --mem_eff_attn
 --gradient_checkpointing
 --xformers
 --bucket_no_upscale
 --network_train_unet_only

where do I put this code? i am noob

ghost commented 1 year ago

For what it's worth, even with --cache_text_encoder_outputs and --network_train_unet_only I still run out of memory on a 8GB RTX 3070.

accelerate launch
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
 --enable_bucket
 --min_bucket_reso=256
 --max_bucket_reso=2048
 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
 --train_data_dir="snip"
 --reg_data_dir="snip"
 --resolution="1024,1024"
 --output_dir="snip"
 --logging_dir="snip"
 --network_alpha="1"
 --save_model_as=safetensors
 --network_module=networks.lora
 --text_encoder_lr=0.0004
 --unet_lr=0.0004
 --network_dim=128
 --output_name="subjectXL"
 --lr_scheduler_num_cycles="10"
 --cache_text_encoder_outputs
 --no_half_vae
 --full_bf16
 --learning_rate="0.0004"
 --lr_scheduler="constant"
 --train_batch_size="1"
 --max_train_steps="11400"
 --save_every_n_epochs="1"
 --mixed_precision="bf16"
 --save_precision="bf16"
 --cache_latents
 --cache_latents_to_disk
 --optimizer_type="Adafactor"
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False
 --max_data_loader_n_workers="0"
 --bucket_reso_steps=64
 --mem_eff_attn
 --gradient_checkpointing
 --xformers
 --bucket_no_upscale
 --network_train_unet_only

where do I put this code? i am noob

You must open a cmd command go to venv/scripts directory with the cd command and launch activate.bat then cd.. cd.. to go back to main directory and you can copy paste

FurkanGozukara commented 1 year ago

For what it's worth, even with --cache_text_encoder_outputs and --network_train_unet_only I still run out of memory on a 8GB RTX 3070.

accelerate launch
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
 --enable_bucket
 --min_bucket_reso=256
 --max_bucket_reso=2048
 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
 --train_data_dir="snip"
 --reg_data_dir="snip"
 --resolution="1024,1024"
 --output_dir="snip"
 --logging_dir="snip"
 --network_alpha="1"
 --save_model_as=safetensors
 --network_module=networks.lora
 --text_encoder_lr=0.0004
 --unet_lr=0.0004
 --network_dim=128
 --output_name="subjectXL"
 --lr_scheduler_num_cycles="10"
 --cache_text_encoder_outputs
 --no_half_vae
 --full_bf16
 --learning_rate="0.0004"
 --lr_scheduler="constant"
 --train_batch_size="1"
 --max_train_steps="11400"
 --save_every_n_epochs="1"
 --mixed_precision="bf16"
 --save_precision="bf16"
 --cache_latents
 --cache_latents_to_disk
 --optimizer_type="Adafactor"
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False
 --max_data_loader_n_workers="0"
 --bucket_reso_steps=64
 --mem_eff_attn
 --gradient_checkpointing
 --xformers
 --bucket_no_upscale
 --network_train_unet_only

where do I put this code? i am noob

i have shown this in my tutorial if you cant make it work : https://youtu.be/AY6DMBCIZ3A

watch 20 46

nonetrix commented 1 year ago

Training works successfully when following above advices (network_dim 8, Unet only) on an RTX 3060ti (ie 8Go VRAM).

launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/ai/models/Stable-diffusion/sd_xl_base_1.0.safetensors" --train_data_dir="C:/ai/lora/Training/img" --reg_data_dir="C:/ai/lora/Training/regularization" --resolution="1024,1024" --output_dir="C:/ai/models/Lora" --logging_dir="C:/ai/lora/Training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=8 --output_name="sdxlcath" --lr_scheduler_num_cycles="1" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5100" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only

Is it normal for the final result to not resemble the training data at all like this? image

For reference, I am training on this character image

It came out just looking like a generic bunny girl ignoring the design completely. Is this because it's unet only? I only let it run for 1500 steps, but that should be more than enough usually I think actually sometimes too much. If so I think it defeats the entire point. Maybe I should try embedding at this point? Maybe we would have better luck if this was in PyTorch 2.0 instead of what appears to be 1.9.0, to my understanding it's much more optimized than even using xformers

nonetrix commented 1 year ago

Stability AI said 8GBs Lora was doable, but I guess they never said how much technically

FurkanGozukara commented 1 year ago

Stability AI said 8GBs Lora was doable, but I guess they never said how much technically

it can be only doable with

train only unet train at 768x768

and it is maybe

currently a decent lora requires 11.5 gb VRAM

i did a lot of testing : https://youtu.be/sBFGitIvD2A

miguelgargallo commented 1 year ago

What may I be doing wrong?

I mean

RTX Data

Graphics Data

RAM Memory Data

RAM Data

Config log ouput

Here is my config

1:47:02-697659 INFO     The running process has been terminated.
21:47:05-148626 INFO     Start training LoRA Standard ...
21:47:05-150631 INFO     Checking for duplicate image filenames in training data directory...
21:47:05-152630 INFO     Valid image folder names found in:
                         Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\img
21:47:05-153632 INFO     Valid image folder names found in:
                         Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\reg
21:47:05-154630 INFO     Folder 40_pearlmadagascar rock: 66 Detroitbecomehuman found
21:47:05-155630 INFO     Folder 40_pearlmadagascar rock: 2640 steps
21:47:05-156630 WARNING  Regularisation Detroitbecomehuman are used... Will double the number of steps required...
21:47:05-157630 INFO     Total steps: 2640
21:47:05-157630 INFO     Train batch size: 1
21:47:05-158630 INFO     Gradient accumulation steps: 1
21:47:05-158630 INFO     Epoch: 1
21:47:05-159630 INFO     Regulatization factor: 2
21:47:05-160630 INFO     max_train_steps (2640 / 1 / 1 * 1 * 2) = 5280
21:47:05-161630 INFO     stop_text_encoder_training = 0
21:47:05-162630 INFO     lr_warmup_steps = 528
21:47:05-163630 INFO     Can't use LR warmup with LR Scheduler constant... ignoring...
21:47:05-163630 INFO     Saving training config to
                         Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\model\pearlmadagascar_20230816-21470
                         5.json...
21:47:05-165632 INFO     accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
                         --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
                         --pretrained_model_name_or_path="Z:/Howarts/Pikachu/Ergonomics/PeterPan/sd_xl_base_1.0.safetensors"
                         --train_data_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\img"
                         --reg_data_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\reg"
                         --resolution="512,512"
                         --output_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\model"
                         --logging_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\log"
                         --network_alpha="16" --save_model_as=safetensors
                         --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003
                         --network_dim=32 --output_name="pearlmadagascar" --lr_scheduler_num_cycles="1"
                         --no_half_vae --learning_rate="0.0003" --lr_scheduler="constant"
                         --train_batch_size="1" --max_train_steps="5280" --save_every_n_epochs="1"
                         --mixed_precision="bf16" --save_precision="bf16" --cache_latents
                         --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args
                         scale_parameter=False relative_step=False warmup_init=False
                         --max_data_loader_n_workers="0" --bucket_reso_steps=64
                         --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0

If anyone can felp me modify my .json, here is the saved file: PylarAI5.json GitHub repo

My Json File

{
  "LoRA_type": "Standard",
  "adaptive_noise_scale": 0,
  "additional_parameters": "",
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 64,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "caption_extension": "",
  "clip_skip": "1",
  "color_aug": false,
  "conv_alpha": 1,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 1,
  "decompose_both": false,
  "dim_from_weights": false,
  "down_lr_weight": "",
  "enable_bucket": true,
  "epoch": 1,
  "factor": -1,
  "flip_aug": false,
  "full_bf16": false,
  "full_fp16": false,
  "gradient_accumulation_steps": "1",
  "gradient_checkpointing": true,
  "keep_tokens": "0",
  "learning_rate": 0.0003,
  "logging_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\log",
  "lora_network_weights": "",
  "lr_scheduler": "constant",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "lr_warmup": 10,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": "0",
  "max_resolution": "512,512",
  "max_timestep": 1000,
  "max_token_length": "75",
  "max_train_epochs": "",
  "mem_eff_attn": false,
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 0,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "module_dropout": 0,
  "multires_noise_discount": 0,
  "multires_noise_iterations": 0,
  "network_alpha": 16,
  "network_dim": 32,
  "network_dropout": 0,
  "no_token_padding": false,
  "noise_offset": 0,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 2,
  "optimizer": "Adafactor",
  "optimizer_args": "scale_parameter=False relative_step=False warmup_init=False",
  "output_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\model",
  "output_name": "pearlmadagascar",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "D:/Howarts/Pikachu/Ergonomics/PeterPan/sd_xl_base_1.0.safetensors",
  "prior_loss_weight": 1.0,
  "random_crop": false,
  "rank_dropout": 0,
  "reg_data_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\reg",
  "resume": "",
  "sample_every_n_epochs": 0,
  "sample_every_n_steps": 0,
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 1,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 0,
  "sdxl": true,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": true,
  "seed": "",
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "text_encoder_lr": 0.0003,
  "train_batch_size": 1,
  "train_data_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\img",
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 0.0003,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": false,
  "use_wandb": false,
  "v2": false,
  "v_parameterization": false,
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "weighted_captions": false,
  "xformers": "xformers"
}
nonetrix commented 1 year ago

Stability AI said 8GBs Lora was doable, but I guess they never said how much technically

it can be only doable with

train only unet train at 768x768

and it is maybe

currently a decent lora requires 11.5 gb VRAM

i did a lot of testing : https://youtu.be/sBFGitIvD2A

My point is that unet only training doesn't seem to provide useful results, completely defeating the point of a Lora model. I would likely get better results with a embedding or something similar

miguelgargallo commented 1 year ago

For what it's worth, even with --cache_text_encoder_outputs and --network_train_unet_only I still run out of memory on a 8GB RTX 3070.

accelerate launch
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
 --enable_bucket
 --min_bucket_reso=256
 --max_bucket_reso=2048
 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
 --train_data_dir="snip"
 --reg_data_dir="snip"
 --resolution="1024,1024"
 --output_dir="snip"
 --logging_dir="snip"
 --network_alpha="1"
 --save_model_as=safetensors
 --network_module=networks.lora
 --text_encoder_lr=0.0004
 --unet_lr=0.0004
 --network_dim=128
 --output_name="subjectXL"
 --lr_scheduler_num_cycles="10"
 --cache_text_encoder_outputs
 --no_half_vae
 --full_bf16
 --learning_rate="0.0004"
 --lr_scheduler="constant"
 --train_batch_size="1"
 --max_train_steps="11400"
 --save_every_n_epochs="1"
 --mixed_precision="bf16"
 --save_precision="bf16"
 --cache_latents
 --cache_latents_to_disk
 --optimizer_type="Adafactor"
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False
 --max_data_loader_n_workers="0"
 --bucket_reso_steps=64
 --mem_eff_attn
 --gradient_checkpointing
 --xformers
 --bucket_no_upscale
 --network_train_unet_only

where do I put this code? i am noob

i have shown this in my tutorial if you cant make it work : https://youtu.be/AY6DMBCIZ3A

watch 20 46

https://youtu.be/AY6DMBCIZ3A?t=1246 minute exact on that link 20:46

FurkanGozukara commented 1 year ago

Stability AI said 8GBs Lora was doable, but I guess they never said how much technically

it can be only doable with train only unet train at 768x768 and it is maybe currently a decent lora requires 11.5 gb VRAM i did a lot of testing : https://youtu.be/sBFGitIvD2A

My point is that unet only training doesn't seem to provide useful results, completely defeating the point of a Lora model. I would likely get better results with a embedding or something similar

i will hopefully test and we will see

actually i tested vram usage and it was same with train unet command interestingly :D

miguelgargallo commented 1 year ago

Is there any possibility to take a look to my config @FurkanGozukara about what could happen¿?

https://github.com/kohya-ss/sd-scripts/issues/661#issuecomment-1681198483

FurkanGozukara commented 1 year ago

what is your issue?

by the way dont do 512,512 on sdxl

do 1024x1024

my tutorial works great with rtx 3060 - 12 gb vram

miguelgargallo commented 1 year ago

what is your issue?

by the way dont do 512,512 on sdxl

do 1024x1024

my tutorial works great with rtx 3060 - 12 gb vram

The issue is that by following the https://github.com/kohya-ss/sd-scripts/issues/661#issuecomment-1655304053 it takes me for 30 hours

FurkanGozukara commented 1 year ago

what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram

The issue is that by following the #661 (comment) it takes me for 30 hours

that means you are having vram bottle neck

how much vram being used when you dont do training?

miguelgargallo commented 1 year ago

what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram

The issue is that by following the #661 (comment) it takes me for 30 hours

that means you are having vram bottle neck

how much vram being used when you dont do training?

As far as I know, 12GB but maybe I am doing something wrong, since its my first time with Lora

image

image

image

image

image

FurkanGozukara commented 1 year ago

what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram

The issue is that by following the #661 (comment) it takes me for 30 hours

that means you are having vram bottle neck how much vram being used when you dont do training?

As far as I know, 12GB but maybe I am doing something wrong, since its my first time with Lora

image

image

image

image

image

no from task bar

after you restart your computer how much vram your machine uses

i have shown all these info in tutorial

https://youtu.be/sBFGitIvD2A

miguelgargallo commented 1 year ago

what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram

The issue is that by following the #661 (comment) it takes me for 30 hours

that means you are having vram bottle neck how much vram being used when you dont do training?

As far as I know, 12GB but maybe I am doing something wrong, since its my first time with Lora image image image image image

no from task bar

after you restart your computer how much vram your machine uses

i have shown all these info in tutorial

https://youtu.be/sBFGitIvD2A

not much 0.6gb

image

FurkanGozukara commented 1 year ago

@miguelgargallo if you can reduce it to 0.5 gb it should work very well

it can work very well with 0.6 gb too possibly

you really try to reduce your vram usage when computer is free

lower your monitor resolution turn off software

i shown all in the video and use correct settings and latest libraries

miguelgargallo commented 1 year ago

I will see your video twice tomo morning! Thanks @FurkanGozukara for your time! I appreciate! (I will make some tries now)

miguelgargallo commented 1 year ago

Simply thanks @FurkanGozukara <3 simply thanks

here is my setup My Setup in case someone have the need to invest on a Workstation for this, and many other uses

miguelgargallo commented 1 year ago

Let's try it now, image

FurkanGozukara commented 1 year ago

Simply thanks @FurkanGozukara <3 simply thanks

here is my setup My Setup in case someone have the need to invest on a Workstation for this, and many other uses

looks like worked nice

miguelgargallo commented 1 year ago

yes, I am very thankful. I have one question. I was training a model later this night, and I did not save anywhere is this somewhere? says /log

FurkanGozukara commented 1 year ago

yes, I am very thankful. I have one question. I was training a model later this night, and I did not save anywhere is this somewhere? says /log

weird never had such problem. currently cant check either

miguelgargallo commented 1 year ago

One last 2 questions, @FurkanGozukara with My Setup, can I add one more RTX 4070 OC and will not explode 🤣 my 1000W box? and one last thing? is that possible, but how I can add two different "Miguels" 2 datasets of my, my childhood and I now in the future in the same image? is that a dual lora? 🚬 I mean, photoshop 12/15 average images and train? or I can train my album of 16th and now separately?

evanheckert commented 1 year ago

So am I reading this thread correctly that a 12GB card will work when training on 768x768, but just barely, so there's no hope for an 8GB card?

FurkanGozukara commented 1 year ago

So am I reading this thread correctly that a 12GB card will work when training on 768x768, but just barely, so there's no hope for an 8GB card?

12 gb can do decent training i made a video : https://youtu.be/sBFGitIvD2A

for 8 gb you can test 768 x 768 and train only unet but still may not be sufficient

probably sd 1.5 better

nonetrix commented 1 year ago

Well debating getting a GPU with 16GBs of VRAM, but unfortunately for my price range it's going to have to be AMD so it might be a bit of a extra hassle. But not something I'm not already used to, I already use Linux and am confident in my abilities so hopefully it's okay

Bonoolu commented 1 year ago

I tried the commands in this thread, but i always get this error when i try to use the model with stable diffusion:

stable-diffusion-webui/extensions/sd-webui-additional-networks/scripts/lora_compvis.py", line 302, in convert_diffusers_name_to_compvis
        assert cv_name is not None, f"conversion failed: {du_name}. the model may not be trained by `sd-scripts`."
               ^^^^^^^^^^^^^^^^^^^
    AssertionError: conversion failed: lora_unet_input_blocks_4_1_proj_in. the model may not be trained by `sd-scripts`.

Any Idea why? I just pulled the latests commits from the sdxl branch of this repo, and of automatic1111 and of the additional networks extension.

Here is my full command for reference, i slightly altered it:

accelerate launch \
 --num_cpu_threads_per_process=2 "./sdxl_train_network.py" \
 --min_bucket_reso=256 \
 --max_bucket_reso=2048 \
 --pretrained_model_name_or_path="/home/bonoolu/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors" \
 --train_data_dir="/home/bonoolu/LORA/MyModelName/image" \
 --reg_data_dir="" \
 --resolution="1024,1024" \
 --output_dir="/home/bonoolu/LORA/MyModelName/model" \
 --logging_dir="/home/bonoolu/LORA/MyModelName/log" \
 --network_alpha="1" \
 --save_model_as=safetensors \
 --network_module=networks.lora \
 --text_encoder_lr=0.0004 \
 --unet_lr=0.0004 \
 --network_dim=128 \
 --output_name="MyModelName_sdxl10" \
 --lr_scheduler_num_cycles="10" \
 --cache_text_encoder_outputs \
 --no_half_vae \
 --full_bf16 \
 --learning_rate="0.0004" \
 --lr_scheduler="constant" \
 --train_batch_size="1" \
 --max_train_steps="190" \
 --save_every_n_epochs="1" \
 --mixed_precision="bf16" \
 --save_precision="bf16" \
 --cache_latents \
 --cache_latents_to_disk \
 --optimizer_type="Adafactor" \
 --optimizer_args scale_parameter=False relative_step=False warmup_init=False \
 --max_data_loader_n_workers="0" \
 --bucket_reso_steps=64 \
 --mem_eff_attn \
 --gradient_checkpointing \
 --xformers \
 --bucket_no_upscale \
 --network_train_unet_only

What exactly is lora_unet_input_blocks_4_1_proj_in ?

TeKett commented 1 year ago

I would love to know how you ppl are getting so low vram usage, even on a network dimension of 1 and alpha 1, with 512,512 resolution im using 12gb of vram training XL Lora. and finetuning is using 40gb.

FurkanGozukara commented 1 year ago

I would love to know how you ppl are getting so low vram usage, even on a network dimension of 1 and alpha 1, with 512,512 resolution im using 12gb of vram training XL Lora. and finetuning is using 40gb.

i am doing full fine tuning with dreambooth with best settings including text encoder under 20 gb : https://youtu.be/EEV8RPohsbw

nonetrix commented 1 year ago

I upgraded my GPU with a card with 16GBs of VRAM, however, unfortunately, it's AMD. It's worked well for everything AI with enough tinkering on Linux(TL;DR use ComfyUI, and LLaMA-CCP or Kobold-CCP with RoCm build flags and everything just werks mostly) surprisingly despite what everyone says, but I haven't tried LoRA training yet, and definitely not SDXL. Anyone has luck?

Enferlain commented 1 year ago

@nonetrix I only tried https://github.com/derrian-distro/LoRA_Easy_Training_Scripts

It worked a month or 2 ago when I just ran a test, but I tried again today for a normal training run and it's either super fucking slow (13 hours for 2k steps) or ooms before the first step, so not looking too great.

orch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.27 GiB. GPU 0 has a total capacty of 15.98 GiB of which 906.00 MiB is free. Of the allocated memory 14.71 GiB is allocated by PyTorch, and 64.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

FurkanGozukara commented 1 year ago

@nonetrix I only tried https://github.com/derrian-distro/LoRA_Easy_Training_Scripts

It worked a month or 2 ago when I just ran a test, but I tried again today for a normal training run and it's either super fucking slow (13 hours for 2k steps) or ooms before the first step, so not looking too great.

orch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.27 GiB. GPU 0 has a total capacty of 15.98 GiB of which 906.00 MiB is free. Of the allocated memory 14.71 GiB is allocated by PyTorch, and 64.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

now you need to enable full fp16 or full bf16