Closed Arron17 closed 1 year ago
how did you even manage to train with 10gb?
i tried on 12gb card always out of memory error
what was your settings?
Hi, I've merged the PR #645, and I believe the latest version will work on 10GB VRAM with fp16/bf16. However, please disable sample generations during training when fp16. It takes a lot of vram.
In addition, I think it may work either on 8GB VRAM.
Hi, I've merged the PR #645, and I believe the latest version will work on 10GB VRAM with fp16/bf16. However, please disable sample generations during training when fp16. It takes a lot of vram.
In addition, I think it may work either on 8GB VRAM.
nice
Started an attempt with the newest commit, looking good so far, I'll report back if it breaks during.
Seems all good. Looks like the PR has resolved the issue
Seems all good. Looks like the PR has resolved the issue
what are your settings?
i still get OOM with 12 gb
Hi, I've merged the PR #645, and I believe the latest version will work on 10GB VRAM with fp16/bf16. However, please disable sample generations during training when fp16. It takes a lot of vram.
In addition, I think it may work either on 8GB VRAM.
hello
i tested many optimizers with new commit but still getting oom
any idea why could be?
I am testing on my second gpu which is 100% empty RTX 3060 12 GB
00:31:52-081849 INFO Start training LoRA Standard ...
00:31:52-082848 INFO Valid image folder names found in: F:/kohya sdxl tutorial files\img
00:31:52-083848 INFO Valid image folder names found in: F:/kohya sdxl tutorial files\reg
00:31:52-084848 INFO Folder 20_ohwx man: 13 images found
00:31:52-085848 INFO Folder 20_ohwx man: 260 steps
00:31:52-085848 INFO [94mRegularisation images are used... Will double the number of steps required...[0m
00:31:52-086848 INFO Total steps: 260
00:31:52-087847 INFO Train batch size: 1
00:31:52-087847 INFO Gradient accumulation steps: 1.0
00:31:52-088848 INFO Epoch: 10
00:31:52-089848 INFO Regulatization factor: 2
00:31:52-090848 INFO max_train_steps (260 / 1 / 1.0 * 10 * 2) = 5200
00:31:52-091849 INFO stop_text_encoder_training = 0
00:31:52-092848 INFO lr_warmup_steps = 0
00:31:52-092848 INFO Saving training config to F:/kohya sdxl tutorial files\model\tutorial_video_20230720-003152.json...
00:31:52-095848 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:/kohya sdxl tutorial files\img" --reg_data_dir="F:/kohya sdxl tutorial
files\reg" --resolution="1024,1024" --output_dir="F:/kohya sdxl tutorial files\model" --logging_dir="F:/kohya sdxl tutorial files\log" --network_alpha="1"
--save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="tutorial_video"
--lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5200"
--save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False
warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --xformers --bucket_no_upscale
error below
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\97 kohya\kohya_ss\sdxl_train_network.py:174 in <module> │
│ │
│ 171 │ args = train_util.read_config_from_file(args, parser) │
│ 172 │ │
│ 173 │ trainer = SdxlNetworkTrainer() │
│ ❱ 174 │ trainer.train(args) │
│ 175 │
│ │
│ D:\97 kohya\kohya_ss\train_network.py:735 in train │
│ │
│ 732 │ │ │ │ │ │ │ latents = batch["latents"].to(accelerator.device) │
│ 733 │ │ │ │ │ │ else: │
│ 734 │ │ │ │ │ │ │ # latentに変換 │
│ ❱ 735 │ │ │ │ │ │ │ latents = vae.encode(batch["images"].to(dtype=vae_dtype)).la │
│ 736 │ │ │ │ │ │ │ │
│ 737 │ │ │ │ │ │ │ # NaNが含まれていれば警告を表示し0に置き換える │
│ 738 │ │ │ │ │ │ │ if torch.any(torch.isnan(latents)): │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\utils\accelerate_utils.py:46 in wrapper │
│ │
│ 43 │ def wrapper(self, *args, **kwargs): │
│ 44 │ │ if hasattr(self, "_hf_hook") and hasattr(self._hf_hook, "pre_forward"): │
│ 45 │ │ │ self._hf_hook.pre_forward(self) │
│ ❱ 46 │ │ return method(self, *args, **kwargs) │
│ 47 │ │
│ 48 │ return wrapper │
│ 49 │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\autoencoder_kl.py:236 in encode │
│ │
│ 233 │ │ │ encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)] │
│ 234 │ │ │ h = torch.cat(encoded_slices) │
│ 235 │ │ else: │
│ ❱ 236 │ │ │ h = self.encoder(x) │
│ 237 │ │ │
│ 238 │ │ moments = self.quant_conv(h) │
│ 239 │ │ posterior = DiagonalGaussianDistribution(moments) │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\vae.py:139 in forward │
│ │
│ 136 │ │ else: │
│ 137 │ │ │ # down │
│ 138 │ │ │ for down_block in self.down_blocks: │
│ ❱ 139 │ │ │ │ sample = down_block(sample) │
│ 140 │ │ │ │
│ 141 │ │ │ # middle │
│ 142 │ │ │ sample = self.mid_block(sample) │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py:1150 in forward │
│ │
│ 1147 │ │
│ 1148 │ def forward(self, hidden_states): │
│ 1149 │ │ for resnet in self.resnets: │
│ ❱ 1150 │ │ │ hidden_states = resnet(hidden_states, temb=None) │
│ 1151 │ │ │
│ 1152 │ │ if self.downsamplers is not None: │
│ 1153 │ │ │ for downsampler in self.downsamplers: │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\diffusers\models\resnet.py:598 in forward │
│ │
│ 595 │ │ else: │
│ 596 │ │ │ hidden_states = self.norm1(hidden_states) │
│ 597 │ │ │
│ ❱ 598 │ │ hidden_states = self.nonlinearity(hidden_states) │
│ 599 │ │ │
│ 600 │ │ if self.upsample is not None: │
│ 601 │ │ │ # upsample_nearest_nhwc fails with large batch sizes. see https://github.com │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\modules\activation.py:396 in forward │
│ │
│ 393 │ │ self.inplace = inplace │
│ 394 │ │
│ 395 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 396 │ │ return F.silu(input, inplace=self.inplace) │
│ 397 │ │
│ 398 │ def extra_repr(self) -> str: │
│ 399 │ │ inplace_str = 'inplace=True' if self.inplace else '' │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\torch\nn\functional.py:2059 in silu │
│ │
│ 2056 │ │ return handle_torch_function(silu, (input,), input, inplace=inplace) │
│ 2057 │ if inplace: │
│ 2058 │ │ return torch._C._nn.silu_(input) │
│ ❱ 2059 │ return torch._C._nn.silu(input) │
│ 2060 │
│ 2061 │
│ 2062 def mish(input: Tensor, inplace: bool = False) -> Tensor: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 12.00 GiB total capacity; 11.01 GiB already allocated; 0 bytes free; 11.24 GiB reserved in total by PyTorch) If
reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps: 0%| | 0/5200 [00:24<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Python3108\lib\runpy.py:196 in _run_module_as_main │
│ │
│ 193 │ main_globals = sys.modules["__main__"].__dict__ │
│ 194 │ if alter_argv: │
│ 195 │ │ sys.argv[0] = mod_spec.origin │
│ ❱ 196 │ return _run_code(code, main_globals, None, │
│ 197 │ │ │ │ │ "__main__", mod_spec) │
│ 198 │
│ 199 def run_module(mod_name, init_globals=None, │
│ │
│ C:\Python3108\lib\runpy.py:86 in _run_code │
│ │
│ 83 │ │ │ │ │ __loader__ = loader, │
│ 84 │ │ │ │ │ __package__ = pkg_name, │
│ 85 │ │ │ │ │ __spec__ = mod_spec) │
│ ❱ 86 │ exec(code, run_globals) │
│ 87 │ return run_globals │
│ 88 │
│ 89 def _run_module_code(code, init_globals=None, │
│ │
│ in <module>:7 │
│ │
│ 4 from accelerate.commands.accelerate_cli import main │
│ 5 if __name__ == '__main__': │
│ 6 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
│ ❱ 7 │ sys.exit(main()) │
│ 8 │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if __name__ == "__main__": │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:918 in launch_command │
│ │
│ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 916 │ │ sagemaker_launcher(defaults, args) │
│ 917 │ else: │
│ ❱ 918 │ │ simple_launcher(args) │
│ 919 │
│ 920 │
│ 921 def main(): │
│ │
│ D:\97 kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:580 in simple_launcher │
│ │
│ 577 │ process.wait() │
│ 578 │ if process.returncode != 0: │
│ 579 │ │ if not args.quiet: │
│ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 581 │ │ else: │
│ 582 │ │ │ sys.exit(1) │
│ 583 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['D:\\97 kohya\\kohya_ss\\venv\\Scripts\\python.exe', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048',
'--pretrained_model_name_or_path=F:/0 models/sd_xl_base_0.9.safetensors', '--train_data_dir=F:/kohya sdxl tutorial files\\img', '--reg_data_dir=F:/kohya sdxl tutorial files\\reg',
'--resolution=1024,1024', '--output_dir=F:/kohya sdxl tutorial files\\model', '--logging_dir=F:/kohya sdxl tutorial files\\log', '--network_alpha=1', '--save_model_as=safetensors',
'--network_module=networks.lora', '--text_encoder_lr=0.0004', '--unet_lr=0.0004', '--network_dim=256', '--output_name=tutorial_video', '--lr_scheduler_num_cycles=10', '--no_half_vae',
'--full_bf16', '--learning_rate=0.0004', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=5200', '--save_every_n_epochs=1', '--mixed_precision=bf16',
'--save_precision=bf16', '--optimizer_type=Adafactor', '--optimizer_args', 'scale_parameter=False', 'relative_step=False', 'warmup_init=False', '--max_data_loader_n_workers=0',
'--bucket_reso_steps=64', '--mem_eff_attn', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.
Please use --cache_latents
(and additionally --cache_latents_to_disk
) option. This option makes VAE unnecessary during the training and reduces memory usage.
Please use
--cache_latents
(and additionally--cache_latents_to_disk
) option. This option makes VAE unnecessary during the training and reduces memory usage.
thank you so much for reply
here 2 more testing i have done. I am testing on 0 memory usage RTX 3060 12 GB - my second GPU
this below command giving this error : AssertionError: network for Text Encoder cannot be trained with caching Text Encoder outputs
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:/kohya sdxl tutorial files\img" --reg_data_dir="F:/kohya sdxl tutorial
files\reg" --resolution="1024,1024" --output_dir="F:/kohya sdxl tutorial files\model" --logging_dir="F:/kohya sdxl tutorial files\log" --network_alpha="1"
--save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="tutorial_video"
--lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1"
--max_train_steps="5200" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor"
--optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn
--gradient_checkpointing --xformers --bucket_no_upscale
And this below command still giving out of vram error for RTX 3060 - 12 GB - system ram is 64 gb
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="F:/0 models/sd_xl_base_0.9.safetensors" --train_data_dir="F:/kohya sdxl tutorial files\img" --reg_data_dir="F:/kohya sdxl tutorial
files\reg" --resolution="1024,1024" --output_dir="F:/kohya sdxl tutorial files\model" --logging_dir="F:/kohya sdxl tutorial files\log" --network_alpha="1"
--save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="tutorial_video"
--lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5200"
--save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args
scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers
--bucket_no_upscale
Please specify --network_train_unet_only
if you caching the text encoder outputs.
For the second command, if you don't use the option --cache_text_encoder_outputs
, Text Encoders are on VRAM, and it uses a lot of VRAM. So please add the option (and also add --network_train_unet_only
).
For what it's worth, even with --cache_text_encoder_outputs
and --network_train_unet_only
I still run out of memory on a 8GB RTX 3070.
accelerate launch
--num_cpu_threads_per_process=2 "./sdxl_train_network.py"
--enable_bucket
--min_bucket_reso=256
--max_bucket_reso=2048
--pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors"
--train_data_dir="snip"
--reg_data_dir="snip"
--resolution="1024,1024"
--output_dir="snip"
--logging_dir="snip"
--network_alpha="1"
--save_model_as=safetensors
--network_module=networks.lora
--text_encoder_lr=0.0004
--unet_lr=0.0004
--network_dim=128
--output_name="subjectXL"
--lr_scheduler_num_cycles="10"
--cache_text_encoder_outputs
--no_half_vae
--full_bf16
--learning_rate="0.0004"
--lr_scheduler="constant"
--train_batch_size="1"
--max_train_steps="11400"
--save_every_n_epochs="1"
--mixed_precision="bf16"
--save_precision="bf16"
--cache_latents
--cache_latents_to_disk
--optimizer_type="Adafactor"
--optimizer_args scale_parameter=False relative_step=False warmup_init=False
--max_data_loader_n_workers="0"
--bucket_reso_steps=64
--mem_eff_attn
--gradient_checkpointing
--xformers
--bucket_no_upscale
--network_train_unet_only
128 for network_dim seems too large. 4 or 8 will work.
For what it's worth, even with
--cache_text_encoder_outputs
and--network_train_unet_only
I still run out of memory on a 8GB RTX 3070.accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors" --train_data_dir="snip" --reg_data_dir="snip" --resolution="1024,1024" --output_dir="snip" --logging_dir="snip" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=128 --output_name="subjectXL" --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="11400" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
someone messaged me and this config worked for 12 gb
hopefully i will do a test asap and let you guys know with my rtx 3060
Training works successfully when following above advices (network_dim 8, Unet only) on an RTX 3060ti (ie 8Go VRAM).
launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/ai/models/Stable-diffusion/sd_xl_base_1.0.safetensors" --train_data_dir="C:/ai/lora/Training/img" --reg_data_dir="C:/ai/lora/Training/regularization" --resolution="1024,1024" --output_dir="C:/ai/models/Lora" --logging_dir="C:/ai/lora/Training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=8 --output_name="sdxlcath" --lr_scheduler_num_cycles="1" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5100" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
Just wanted to report training with the text encoder working on a 3080 Ti 12GB GPU.
If I disable the text encoder training I can up the network_dim to 256, but if I enable the text encoder training I had to lower the network_dim to 32, I'm just happy I have the option now to train with or without training the text encoder on my 12GB GPU :)
subprocess.run([
"accelerate",
"launch",
"--num_cpu_threads_per_process=8",
"./sdxl_train_network.py",
"--enable_bucket",
"--min_bucket_reso=256",
"--max_bucket_reso=2048",
"--pretrained_model_name_or_path=models/sd_xl_base_1.0.safetensors",
"--train_data_dir=traintest",
"--resolution=1024,1024",
"--output_dir=traintest",
"--logging_dir=traintest",
"--network_alpha=1",
"--save_model_as=safetensors",
"--network_module=networks.lora",
"--text_encoder_lr=0.0004",
"--unet_lr=0.0004",
"--network_dim=32",
"--output_name=traintest",
"--lr_scheduler_num_cycles=10",
"--no_half_vae",
"--full_bf16",
"--learning_rate=0.0004",
"--lr_scheduler=constant",
"--train_batch_size=1",
"--max_train_steps=1000",
"--save_every_n_epochs=1",
"--mixed_precision=bf16",
"--save_precision=bf16",
"--cache_latents",
"--cache_latents_to_disk",
"--optimizer_type=Adafactor",
"--optimizer_args",
"scale_parameter=False",
"relative_step=False",
"warmup_init=False",
"--max_data_loader_n_workers=0",
"--bucket_reso_steps=64",
"--mem_eff_attn",
"--gradient_checkpointing",
"--xformers",
"--bucket_no_upscale"
])
For what it's worth, even with
--cache_text_encoder_outputs
and--network_train_unet_only
I still run out of memory on a 8GB RTX 3070.accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors" --train_data_dir="snip" --reg_data_dir="snip" --resolution="1024,1024" --output_dir="snip" --logging_dir="snip" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=128 --output_name="subjectXL" --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="11400" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
where do I put this code? i am noob
For what it's worth, even with
--cache_text_encoder_outputs
and--network_train_unet_only
I still run out of memory on a 8GB RTX 3070.accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors" --train_data_dir="snip" --reg_data_dir="snip" --resolution="1024,1024" --output_dir="snip" --logging_dir="snip" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=128 --output_name="subjectXL" --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="11400" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
where do I put this code? i am noob
You must open a cmd command go to venv/scripts directory with the cd command and launch activate.bat then cd.. cd.. to go back to main directory and you can copy paste
For what it's worth, even with
--cache_text_encoder_outputs
and--network_train_unet_only
I still run out of memory on a 8GB RTX 3070.accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors" --train_data_dir="snip" --reg_data_dir="snip" --resolution="1024,1024" --output_dir="snip" --logging_dir="snip" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=128 --output_name="subjectXL" --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="11400" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
where do I put this code? i am noob
i have shown this in my tutorial if you cant make it work : https://youtu.be/AY6DMBCIZ3A
watch 20 46
Training works successfully when following above advices (network_dim 8, Unet only) on an RTX 3060ti (ie 8Go VRAM).
launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/ai/models/Stable-diffusion/sd_xl_base_1.0.safetensors" --train_data_dir="C:/ai/lora/Training/img" --reg_data_dir="C:/ai/lora/Training/regularization" --resolution="1024,1024" --output_dir="C:/ai/models/Lora" --logging_dir="C:/ai/lora/Training/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=8 --output_name="sdxlcath" --lr_scheduler_num_cycles="1" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5100" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
Is it normal for the final result to not resemble the training data at all like this?
For reference, I am training on this character
It came out just looking like a generic bunny girl ignoring the design completely. Is this because it's unet only? I only let it run for 1500 steps, but that should be more than enough usually I think actually sometimes too much. If so I think it defeats the entire point. Maybe I should try embedding at this point? Maybe we would have better luck if this was in PyTorch 2.0 instead of what appears to be 1.9.0, to my understanding it's much more optimized than even using xformers
Stability AI said 8GBs Lora was doable, but I guess they never said how much technically
Stability AI said 8GBs Lora was doable, but I guess they never said how much technically
it can be only doable with
train only unet train at 768x768
and it is maybe
currently a decent lora requires 11.5 gb VRAM
i did a lot of testing : https://youtu.be/sBFGitIvD2A
I mean
Here is my config
1:47:02-697659 INFO The running process has been terminated.
21:47:05-148626 INFO Start training LoRA Standard ...
21:47:05-150631 INFO Checking for duplicate image filenames in training data directory...
21:47:05-152630 INFO Valid image folder names found in:
Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\img
21:47:05-153632 INFO Valid image folder names found in:
Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\reg
21:47:05-154630 INFO Folder 40_pearlmadagascar rock: 66 Detroitbecomehuman found
21:47:05-155630 INFO Folder 40_pearlmadagascar rock: 2640 steps
21:47:05-156630 WARNING Regularisation Detroitbecomehuman are used... Will double the number of steps required...
21:47:05-157630 INFO Total steps: 2640
21:47:05-157630 INFO Train batch size: 1
21:47:05-158630 INFO Gradient accumulation steps: 1
21:47:05-158630 INFO Epoch: 1
21:47:05-159630 INFO Regulatization factor: 2
21:47:05-160630 INFO max_train_steps (2640 / 1 / 1 * 1 * 2) = 5280
21:47:05-161630 INFO stop_text_encoder_training = 0
21:47:05-162630 INFO lr_warmup_steps = 528
21:47:05-163630 INFO Can't use LR warmup with LR Scheduler constant... ignoring...
21:47:05-163630 INFO Saving training config to
Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\model\pearlmadagascar_20230816-21470
5.json...
21:47:05-165632 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py"
--enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="Z:/Howarts/Pikachu/Ergonomics/PeterPan/sd_xl_base_1.0.safetensors"
--train_data_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\img"
--reg_data_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\reg"
--resolution="512,512"
--output_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\model"
--logging_dir="Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\log"
--network_alpha="16" --save_model_as=safetensors
--network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003
--network_dim=32 --output_name="pearlmadagascar" --lr_scheduler_num_cycles="1"
--no_half_vae --learning_rate="0.0003" --lr_scheduler="constant"
--train_batch_size="1" --max_train_steps="5280" --save_every_n_epochs="1"
--mixed_precision="bf16" --save_precision="bf16" --cache_latents
--cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args
scale_parameter=False relative_step=False warmup_init=False
--max_data_loader_n_workers="0" --bucket_reso_steps=64
--gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0
If anyone can felp me modify my .json, here is the saved file: PylarAI5.json GitHub repo
{
"LoRA_type": "Standard",
"adaptive_noise_scale": 0,
"additional_parameters": "",
"block_alphas": "",
"block_dims": "",
"block_lr_zero_threshold": "",
"bucket_no_upscale": true,
"bucket_reso_steps": 64,
"cache_latents": true,
"cache_latents_to_disk": true,
"caption_dropout_every_n_epochs": 0.0,
"caption_dropout_rate": 0,
"caption_extension": "",
"clip_skip": "1",
"color_aug": false,
"conv_alpha": 1,
"conv_block_alphas": "",
"conv_block_dims": "",
"conv_dim": 1,
"decompose_both": false,
"dim_from_weights": false,
"down_lr_weight": "",
"enable_bucket": true,
"epoch": 1,
"factor": -1,
"flip_aug": false,
"full_bf16": false,
"full_fp16": false,
"gradient_accumulation_steps": "1",
"gradient_checkpointing": true,
"keep_tokens": "0",
"learning_rate": 0.0003,
"logging_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\log",
"lora_network_weights": "",
"lr_scheduler": "constant",
"lr_scheduler_num_cycles": "",
"lr_scheduler_power": "",
"lr_warmup": 10,
"max_bucket_reso": 2048,
"max_data_loader_n_workers": "0",
"max_resolution": "512,512",
"max_timestep": 1000,
"max_token_length": "75",
"max_train_epochs": "",
"mem_eff_attn": false,
"mid_lr_weight": "",
"min_bucket_reso": 256,
"min_snr_gamma": 0,
"min_timestep": 0,
"mixed_precision": "bf16",
"model_list": "custom",
"module_dropout": 0,
"multires_noise_discount": 0,
"multires_noise_iterations": 0,
"network_alpha": 16,
"network_dim": 32,
"network_dropout": 0,
"no_token_padding": false,
"noise_offset": 0,
"noise_offset_type": "Original",
"num_cpu_threads_per_process": 2,
"optimizer": "Adafactor",
"optimizer_args": "scale_parameter=False relative_step=False warmup_init=False",
"output_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\model",
"output_name": "pearlmadagascar",
"persistent_data_loader_workers": false,
"pretrained_model_name_or_path": "D:/Howarts/Pikachu/Ergonomics/PeterPan/sd_xl_base_1.0.safetensors",
"prior_loss_weight": 1.0,
"random_crop": false,
"rank_dropout": 0,
"reg_data_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\reg",
"resume": "",
"sample_every_n_epochs": 0,
"sample_every_n_steps": 0,
"sample_prompts": "",
"sample_sampler": "euler_a",
"save_every_n_epochs": 1,
"save_every_n_steps": 0,
"save_last_n_steps": 0,
"save_last_n_steps_state": 0,
"save_model_as": "safetensors",
"save_precision": "bf16",
"save_state": false,
"scale_v_pred_loss_like_noise_pred": false,
"scale_weight_norms": 0,
"sdxl": true,
"sdxl_cache_text_encoder_outputs": false,
"sdxl_no_half_vae": true,
"seed": "",
"shuffle_caption": false,
"stop_text_encoder_training": 0,
"text_encoder_lr": 0.0003,
"train_batch_size": 1,
"train_data_dir": "Z:/BlackBerry/Detroitbecomehuman/WallStreetJournal/pearl/output\\img",
"train_on_input": true,
"training_comment": "",
"unet_lr": 0.0003,
"unit": 1,
"up_lr_weight": "",
"use_cp": false,
"use_wandb": false,
"v2": false,
"v_parameterization": false,
"vae_batch_size": 0,
"wandb_api_key": "",
"weighted_captions": false,
"xformers": "xformers"
}
Stability AI said 8GBs Lora was doable, but I guess they never said how much technically
it can be only doable with
train only unet train at 768x768
and it is maybe
currently a decent lora requires 11.5 gb VRAM
i did a lot of testing : https://youtu.be/sBFGitIvD2A
My point is that unet only training doesn't seem to provide useful results, completely defeating the point of a Lora model. I would likely get better results with a embedding or something similar
For what it's worth, even with
--cache_text_encoder_outputs
and--network_train_unet_only
I still run out of memory on a 8GB RTX 3070.accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/snip/sd_xl_base_1.0.safetensors" --train_data_dir="snip" --reg_data_dir="snip" --resolution="1024,1024" --output_dir="snip" --logging_dir="snip" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=128 --output_name="subjectXL" --lr_scheduler_num_cycles="10" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="11400" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only
where do I put this code? i am noob
i have shown this in my tutorial if you cant make it work : https://youtu.be/AY6DMBCIZ3A
watch 20 46
https://youtu.be/AY6DMBCIZ3A?t=1246 minute exact on that link 20:46
Stability AI said 8GBs Lora was doable, but I guess they never said how much technically
it can be only doable with train only unet train at 768x768 and it is maybe currently a decent lora requires 11.5 gb VRAM i did a lot of testing : https://youtu.be/sBFGitIvD2A
My point is that unet only training doesn't seem to provide useful results, completely defeating the point of a Lora model. I would likely get better results with a embedding or something similar
i will hopefully test and we will see
actually i tested vram usage and it was same with train unet command interestingly :D
Is there any possibility to take a look to my config @FurkanGozukara about what could happen¿?
https://github.com/kohya-ss/sd-scripts/issues/661#issuecomment-1681198483
what is your issue?
by the way dont do 512,512 on sdxl
do 1024x1024
my tutorial works great with rtx 3060 - 12 gb vram
what is your issue?
by the way dont do 512,512 on sdxl
do 1024x1024
my tutorial works great with rtx 3060 - 12 gb vram
The issue is that by following the https://github.com/kohya-ss/sd-scripts/issues/661#issuecomment-1655304053 it takes me for 30 hours
what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram
The issue is that by following the #661 (comment) it takes me for 30 hours
that means you are having vram bottle neck
how much vram being used when you dont do training?
what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram
The issue is that by following the #661 (comment) it takes me for 30 hours
that means you are having vram bottle neck
how much vram being used when you dont do training?
As far as I know, 12GB but maybe I am doing something wrong, since its my first time with Lora
what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram
The issue is that by following the #661 (comment) it takes me for 30 hours
that means you are having vram bottle neck how much vram being used when you dont do training?
As far as I know, 12GB but maybe I am doing something wrong, since its my first time with Lora
no from task bar
after you restart your computer how much vram your machine uses
i have shown all these info in tutorial
what is your issue? by the way dont do 512,512 on sdxl do 1024x1024 my tutorial works great with rtx 3060 - 12 gb vram
The issue is that by following the #661 (comment) it takes me for 30 hours
that means you are having vram bottle neck how much vram being used when you dont do training?
As far as I know, 12GB but maybe I am doing something wrong, since its my first time with Lora
no from task bar
after you restart your computer how much vram your machine uses
i have shown all these info in tutorial
not much 0.6gb
@miguelgargallo if you can reduce it to 0.5 gb it should work very well
it can work very well with 0.6 gb too possibly
you really try to reduce your vram usage when computer is free
lower your monitor resolution turn off software
i shown all in the video and use correct settings and latest libraries
I will see your video twice tomo morning! Thanks @FurkanGozukara for your time! I appreciate! (I will make some tries now)
Simply thanks @FurkanGozukara <3
here is my setup My Setup in case someone have the need to invest on a Workstation for this, and many other uses
Let's try it now,
Simply thanks @FurkanGozukara <3
here is my setup My Setup in case someone have the need to invest on a Workstation for this, and many other uses
looks like worked nice
yes, I am very thankful. I have one question. I was training a model later this night, and I did not save anywhere is this somewhere? says /log
yes, I am very thankful. I have one question. I was training a model later this night, and I did not save anywhere is this somewhere? says /log
weird never had such problem. currently cant check either
One last 2 questions, @FurkanGozukara with My Setup, can I add one more RTX 4070 OC and will not explode 🤣 my 1000W box? and one last thing? is that possible, but how I can add two different "Miguels" 2 datasets of my, my childhood and I now in the future in the same image? is that a dual lora? 🚬 I mean, photoshop 12/15 average images and train? or I can train my album of 16th and now separately?
So am I reading this thread correctly that a 12GB card will work when training on 768x768, but just barely, so there's no hope for an 8GB card?
So am I reading this thread correctly that a 12GB card will work when training on 768x768, but just barely, so there's no hope for an 8GB card?
12 gb can do decent training i made a video : https://youtu.be/sBFGitIvD2A
for 8 gb you can test 768 x 768 and train only unet but still may not be sufficient
probably sd 1.5 better
Well debating getting a GPU with 16GBs of VRAM, but unfortunately for my price range it's going to have to be AMD so it might be a bit of a extra hassle. But not something I'm not already used to, I already use Linux and am confident in my abilities so hopefully it's okay
I tried the commands in this thread, but i always get this error when i try to use the model with stable diffusion:
stable-diffusion-webui/extensions/sd-webui-additional-networks/scripts/lora_compvis.py", line 302, in convert_diffusers_name_to_compvis
assert cv_name is not None, f"conversion failed: {du_name}. the model may not be trained by `sd-scripts`."
^^^^^^^^^^^^^^^^^^^
AssertionError: conversion failed: lora_unet_input_blocks_4_1_proj_in. the model may not be trained by `sd-scripts`.
Any Idea why? I just pulled the latests commits from the sdxl branch of this repo, and of automatic1111 and of the additional networks extension.
Here is my full command for reference, i slightly altered it:
accelerate launch \
--num_cpu_threads_per_process=2 "./sdxl_train_network.py" \
--min_bucket_reso=256 \
--max_bucket_reso=2048 \
--pretrained_model_name_or_path="/home/bonoolu/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors" \
--train_data_dir="/home/bonoolu/LORA/MyModelName/image" \
--reg_data_dir="" \
--resolution="1024,1024" \
--output_dir="/home/bonoolu/LORA/MyModelName/model" \
--logging_dir="/home/bonoolu/LORA/MyModelName/log" \
--network_alpha="1" \
--save_model_as=safetensors \
--network_module=networks.lora \
--text_encoder_lr=0.0004 \
--unet_lr=0.0004 \
--network_dim=128 \
--output_name="MyModelName_sdxl10" \
--lr_scheduler_num_cycles="10" \
--cache_text_encoder_outputs \
--no_half_vae \
--full_bf16 \
--learning_rate="0.0004" \
--lr_scheduler="constant" \
--train_batch_size="1" \
--max_train_steps="190" \
--save_every_n_epochs="1" \
--mixed_precision="bf16" \
--save_precision="bf16" \
--cache_latents \
--cache_latents_to_disk \
--optimizer_type="Adafactor" \
--optimizer_args scale_parameter=False relative_step=False warmup_init=False \
--max_data_loader_n_workers="0" \
--bucket_reso_steps=64 \
--mem_eff_attn \
--gradient_checkpointing \
--xformers \
--bucket_no_upscale \
--network_train_unet_only
What exactly is lora_unet_input_blocks_4_1_proj_in
?
I would love to know how you ppl are getting so low vram usage, even on a network dimension of 1 and alpha 1, with 512,512 resolution im using 12gb of vram training XL Lora. and finetuning is using 40gb.
I would love to know how you ppl are getting so low vram usage, even on a network dimension of 1 and alpha 1, with 512,512 resolution im using 12gb of vram training XL Lora. and finetuning is using 40gb.
i am doing full fine tuning with dreambooth with best settings including text encoder under 20 gb : https://youtu.be/EEV8RPohsbw
I upgraded my GPU with a card with 16GBs of VRAM, however, unfortunately, it's AMD. It's worked well for everything AI with enough tinkering on Linux(TL;DR use ComfyUI, and LLaMA-CCP or Kobold-CCP with RoCm build flags and everything just werks mostly) surprisingly despite what everyone says, but I haven't tried LoRA training yet, and definitely not SDXL. Anyone has luck?
@nonetrix I only tried https://github.com/derrian-distro/LoRA_Easy_Training_Scripts
It worked a month or 2 ago when I just ran a test, but I tried again today for a normal training run and it's either super fucking slow (13 hours for 2k steps) or ooms before the first step, so not looking too great.
orch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.27 GiB. GPU 0 has a total capacty of 15.98 GiB of which 906.00 MiB is free. Of the allocated memory 14.71 GiB is allocated by PyTorch, and 64.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
@nonetrix I only tried https://github.com/derrian-distro/LoRA_Easy_Training_Scripts
It worked a month or 2 ago when I just ran a test, but I tried again today for a normal training run and it's either super fucking slow (13 hours for 2k steps) or ooms before the first step, so not looking too great.
orch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.27 GiB. GPU 0 has a total capacty of 15.98 GiB of which 906.00 MiB is free. Of the allocated memory 14.71 GiB is allocated by PyTorch, and 64.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
now you need to enable full fp16 or full bf16
When using commit - 747af145ed32eb85205dca144a4e49f25032d130
I am able to train on a 3080 10GB Card without issues.
After updating to the latest commit, I get out of memory issues on every try.
I've even tried to lower the image resolution to very small values like 256x256 and I get the same out of memory errors on the GPU.
I believe something has changed between then that has caused this regression. One major thing that seems to have changed is that teh newer version uses the StableDiffusionXLPipeline, whereas the old commit does not, could this be part of the issue?