kohya-ss / sd-scripts

Apache License 2.0
5.33k stars 881 forks source link

faster-block-swap works amazing but it has problems for 12 GB and below GPUs #1764

Closed FurkanGozukara closed 1 week ago

FurkanGozukara commented 3 weeks ago

First of all @kohya-ss so amazing work

My previously 10.2 second / it 23.1 GB VRAM using config now 7.08 second / it and it uses 21.8 GB

My previously 15.1 GB using 13.8 second / it config is now 9.06 second still uses same VRAM

Now the issue starts after this now it can maximum swap 28 blocks

Previously it was higher so it was able to train even with 36 block swap and as low as 6 GB GPUs

Now i got this error with 29 block swap

2024-11-05 23:46:52 INFO     Building CLIP-L                  flux_utils.py:163
                    INFO     Loading state dict from          flux_utils.py:259
                             /home/Ubuntu/Downloads/clip_l.sa                  
                             fetensors                                         
2024-11-05 23:46:53 INFO     Loaded CLIP-L: <All keys matched flux_utils.py:262
                             successfully>                                     
                    INFO     Loading state dict from          flux_utils.py:314
                             /home/Ubuntu/Downloads/t5xxl_fp1                  
                             6.safetensors                                     
2024-11-05 23:47:03 INFO     Loaded T5xxl: <All keys matched  flux_utils.py:317
                             successfully>                                     
2024-11-05 23:47:06 INFO     [Dataset 0]                     train_util.py:2515
                    INFO     caching Text Encoder outputs    train_util.py:1231
                             with caching strategy.                            
                    INFO     checking cache validity...      train_util.py:1242
100%|████████████████████████████████████████| 28/28 [00:00<00:00, 2241.96it/s]
                    INFO     no Text Encoder outputs to      train_util.py:1269
                             cache                                             
                    INFO     cache Text Encoder outputs for   flux_train.py:242
                             sample prompt:                                    
                             /home/Ubuntu/apps/StableSwarmUI/                  
                             Models/diffusion_models/sample/p                  
                             rompt.txt                                         
2024-11-05 23:47:07 INFO     Checking the state dict:          flux_utils.py:43
                             Diffusers or BFL, dev or schnell                  
                    INFO     Building Flux model dev from BFL flux_utils.py:101
                             checkpoint                                        
                    INFO     Loading state dict from          flux_utils.py:118
                             /home/Ubuntu/Downloads/flux1-dev                  
                             .safetensors                                      
                    INFO     Loaded Flux: <All keys matched   flux_utils.py:137
                             successfully>                                     
FLUX: Gradient checkpointing enabled. CPU offload: False
                    INFO     enable block swap:               flux_train.py:297
                             blocks_to_swap=29                                 
FLUX: Block swap enabled. Swapping 29 blocks, double blocks: 14, single blocks: 30.
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
                    INFO     use Adafactor optimizer |       train_util.py:4764
                             {'scale_parameter': False,                        
                             'relative_step': False,                           
                             'warmup_init': False,                             
                             'weight_decay': 0.01}                             
                    WARNING  because max_grad_norm is set,   train_util.py:4792
                             clip_grad_norm is enabled.                        
                             consider set to 0 /                               
                             max_grad_normが設定されているた                   
                             めclip_grad_normが有効になりま                    
                             す。0に設定して無効にしたほうが                   
                             いいかもしれません                                
                    WARNING  constant_with_warmup will be    train_util.py:4796
                             good /                                            
                             スケジューラはconstant_with_war                   
                             mupが良いかもしれません                           
enable full bf16 training.
running training / 学習開始
  num examples / サンプル数: 28
  num batches per epoch / 1epochのバッチ数: 28
  num epochs / epoch数: 200
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 5600
steps:   0%|                                          | 0/5600 [00:00<?, ?it/s]
epoch 1/200
2024-11-05 23:47:33 INFO     epoch is incremented.            train_util.py:715
                             current_epoch: 0, epoch: 1                        
Traceback (most recent call last):
  File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py", line 993, in <module>
    train(args)
  File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py", line 782, in train
    model_pred = flux(
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 1122, in forward
    img = block(img, vec=vec, pe=pe, txt_attention_mask=txt_attention_mask)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 840, in forward
    return checkpoint(self._forward, x, vec, pe, txt_attention_mask, use_reentrant=False)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
    ret = function(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 805, in _forward
    mod, _ = self.modulation(vec)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 639, in forward
    out = self.lin(nn.functional.silu(vec))[:, None, :].chunk(self.multiplier, dim=-1)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_addmm)
steps:   0%|                                          | 0/5600 [00:30<?, ?it/s]
Traceback (most recent call last):
  File "/home/Ubuntu/apps/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/Ubuntu/apps/kohya_ss/venv/bin/python', '/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py', '--config_file', '/home/Ubuntu/apps/StableSwarmUI/Models/diffusion_models/config_dreambooth-20241105-234635.toml']' returned non-zero exit status 1.
23:47:48-037437 INFO     Training has ended.        
kohya-ss commented 3 weeks ago

Fixed an error in the code. It should work up to 33.

FurkanGozukara commented 3 weeks ago

Fixed an error in the code. It should work up to 33.

you are the man will test thank you so much

FurkanGozukara commented 2 weeks ago

@kohya-ss any chance we could force it up?

i tested and works great with 33 blocks

and with cpu offloading enabled now 8 GB GPUs can also train - uses 7 GB VRAM

when can you merge it into main branch?

kohya-ss commented 1 week ago

I think this issue is fixed.

FurkanGozukara commented 1 week ago

yes working great ty so much