Training on 2x H100 on Ubuntu and speed is same as 1x H100 what we are doing wrong?

FurkanGozukara commented 1 month ago

When training batch size 4 on H100 the speed is 1.27 second / it

When training batch size 4 on 2x H100 the speed is 2.05 second / it

So basically we almost got no speed boost from multiple GPU training

Is this expected? I am training on SDXL RealVis XL model with 1024 no bucketing

We are using latest bmaltais Kohya GUI on Ubuntu with the below multi-gpu configuration

@kohya-ss @bmaltais

this below is training json config

{
  "adaptive_noise_scale": 0,
  "additional_parameters": "--max_grad_norm=0.0 --no_half_vae --train_text_encoder",
  "async_upload": false,
  "bucket_no_upscale": true,
  "bucket_reso_steps": 64,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0,
  "caption_dropout_rate": 0,
  "caption_extension": "",
  "clip_skip": 1,
  "color_aug": false,
  "dataset_config": "",
  "debiased_estimation_loss": false,
  "dynamo_backend": "no",
  "dynamo_mode": "default",
  "dynamo_use_dynamic": false,
  "dynamo_use_fullgraph": false,
  "enable_bucket": false,
  "epoch": 50,
  "extra_accelerate_launch_args": "",
  "flip_aug": false,
  "full_bf16": true,
  "full_fp16": false,
  "gpu_ids": "1,2",
  "gradient_accumulation_steps": 1,
  "gradient_checkpointing": false,
  "huber_c": 0.1,
  "huber_schedule": "snr",
  "huggingface_path_in_repo": "",
  "huggingface_repo_id": "",
  "huggingface_repo_type": "",
  "huggingface_repo_visibility": "",
  "huggingface_token": "",
  "ip_noise_gamma": 0,
  "ip_noise_gamma_random_strength": false,
  "keep_tokens": 0,
  "learning_rate": 8e-06,
  "learning_rate_te": 1e-05,
  "learning_rate_te1": 3e-06,
  "learning_rate_te2": 0,
  "log_tracker_config": "",
  "log_tracker_name": "",
  "log_with": "",
  "logging_dir": "",
  "loss_type": "l2",
  "lr_scheduler": "constant",
  "lr_scheduler_args": "",
  "lr_scheduler_num_cycles": 1,
  "lr_scheduler_power": 1,
  "lr_warmup": 0,
  "main_process_port": 0,
  "masked_loss": false,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": 0,
  "max_resolution": "1024,1024",
  "max_timestep": 1000,
  "max_token_length": 75,
  "max_train_epochs": 0,
  "max_train_steps": 0,
  "mem_eff_attn": false,
  "metadata_author": "",
  "metadata_description": "",
  "metadata_license": "",
  "metadata_tags": "",
  "metadata_title": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 0,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "multi_gpu": true,
  "multires_noise_discount": 0,
  "multires_noise_iterations": 0,
  "no_token_padding": false,
  "noise_offset": 0,
  "noise_offset_random_strength": false,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 4,
  "num_machines": 1,
  "num_processes": 2,
  "optimizer": "Adafactor",
  "optimizer_args": "scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01",
  "output_dir": "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion",
  "output_name": "shoes_test_2",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/RealVisXL_V4.0.safetensors",
  "prior_loss_weight": 1,
  "random_crop": false,
  "reg_data_dir": "",
  "resume": "",
  "resume_from_huggingface": "",
  "sample_every_n_epochs": 0,
  "sample_every_n_steps": 0,
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 10,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "save_state_on_train_end": false,
  "save_state_to_huggingface": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "sdxl": true,
  "seed": 0,
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "train_batch_size": 4,
  "train_data_dir": "/home/Ubuntu/Desktop/shoes_train_datasets/test1/img",
  "v2": false,
  "v_parameterization": false,
  "v_pred_like_loss": 0,
  "vae": "stabilityai/sdxl-vae",
  "vae_batch_size": 8,
  "wandb_api_key": "",
  "wandb_run_name": "",
  "weighted_captions": false,
  "xformers": "xformers"
}

TOML file

bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
clip_skip = 1
dynamo_backend = "no"
epoch = 50
full_bf16 = true
gradient_accumulation_steps = 1
huber_c = 0.1
huber_schedule = "snr"
learning_rate = 8e-6
learning_rate_te1 = 3e-6
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_timestep = 1000
max_token_length = 75
max_train_steps = 1175
min_bucket_reso = 256
mixed_precision = "bf16"
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
optimizer_type = "Adafactor"
output_dir = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion"
output_name = "shoes_test_2"
pretrained_model_name_or_path = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/RealVisXL_V4.0.safetensors"
prior_loss_weight = 1
resolution = "1024,1024"
sample_prompts = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
train_batch_size = 4
train_data_dir = "/home/Ubuntu/Desktop/shoes_train_datasets/test1/img"
vae = "stabilityai/sdxl-vae"
vae_batch_size = 8
xformers = true

bmaltais commented 1 month ago

You could probably provide a copy of the toml as this is what as-scripts ultimately consume and it should make it easier for @kohya-ss to troubleshoot without being concerned with the gui config.

Many users have been complaining about issues with multiple GPU so I am curious to learn if perhaps it is something I am doing wrong with the gui… like not properly handling of parameters or actually not allowing needed parameters to be entered.

FurkanGozukara commented 1 month ago

You could probably provide a copy of the toml as this is what as-scripts ultimately consume and it should make it easier for @kohya-ss to troubleshoot without being concerned with the gui config.

Many users have been complaining about issues with multiple GPU so I am curious to learn if perhaps it is something I am doing wrong with the gui… like not properly handling of parameters or actually not allowing needed parameters to be entered.

here it is

bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
clip_skip = 1
dynamo_backend = "no"
epoch = 50
full_bf16 = true
gradient_accumulation_steps = 1
huber_c = 0.1
huber_schedule = "snr"
learning_rate = 8e-6
learning_rate_te1 = 3e-6
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_timestep = 1000
max_token_length = 75
max_train_steps = 1175
min_bucket_reso = 256
mixed_precision = "bf16"
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
optimizer_type = "Adafactor"
output_dir = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion"
output_name = "shoes_test_2"
pretrained_model_name_or_path = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/RealVisXL_V4.0.safetensors"
prior_loss_weight = 1
resolution = "1024,1024"
sample_prompts = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
train_batch_size = 4
train_data_dir = "/home/Ubuntu/Desktop/shoes_train_datasets/test1/img"
vae = "stabilityai/sdxl-vae"
vae_batch_size = 8
xformers = true

FurkanGozukara commented 1 month ago

@aria1th @BootsofLagrangian any ideas?

feffy380 commented 1 month ago

AFAIK batch size is per device, so the effective batch size is 4x2 = 8, which is why it's about half as fast. To get the same global batch size you need to divide by the number of devices, but this is a ridiculously small batch size considering you're using H100s and most of your time is being wasted on communication overhead between cards. You should be jacking the batch size way up

FurkanGozukara commented 1 month ago

AFAIK batch size is per device, so the effective batch size is 4x2 = 8, which is why it's about half as fast. To get the same global batch size you need to divide by the number of devices, but this is a ridiculously small batch size considering you're using H100s and most of your time is being wasted on communication overhead between cards. You should be jacking the batch size way up

i know it is. each gpu could go up maximum 7 batch size i tested. still wouldn't make difference since the communication overhead is just crazy. before this new multi gpu training system it was way faster. i was doing dual T4 gpu training on Kaggle and there were almost no such communication delay. moreover with new system i never could make it work on Kaggle either

aria1th commented 1 month ago

The slight performance degradation is expected due to communication overhead, its normal. Its more bottlenecked by system itself hardware - which is why everyone is trying to have "less communication bottleneck system" and even B100 / B200 / etc, as NVIDIA says. Batch size makes drastic differences, yes, so you must make it as high as your card can handle. But, if your system is flawed - like H100 in NFS storage (wtf?) or bad system (bandwidth) then you can't get any advantages from it.

GCP always knew that hardware is the most important one - you would never get bottleneck from that, but if you're using other service provider, you should check the factors...

But if its 'version dependent' then uhh..... kohya script does not handle communication, accelerate does it...

FurkanGozukara commented 1 month ago

@aria1th this was on same machine rented on Massed Compute

what hardware i have to check? this speed loss is just huge. maybe i am doing something wrong?

aria1th commented 1 month ago

mainboard, storage, RAM, CPU.... bottleneck can happen from various causes.... and you have to check them all first

FurkanGozukara commented 1 month ago

mainboard, storage, RAM, CPU.... bottleneck can happen from various causes.... and you have to check them all first

i doubt that any of them is the cause. you get a very powerful VM. also single GPU speed looks very accurate

so lets say any of them is the cause how to debug it?

bmaltais commented 1 month ago

Have you recentrly tried to use the version that used to work fine on the same system? It is possible the hosting has changed the type of machine and it is resulting in this issue?

If the speed is back up, then you could provide kohya with the information about what sd-scripts code base used to work best and he might be able to pin-point where the speed issue is coming from?

FurkanGozukara commented 1 month ago

Have you recentrly tried to use the version that used to work fine on the same system? It is possible the hosting has changed the type of machine and it is resulting in this issue?

If the speed is back up, then you could provide kohya with the information about what sd-scripts code base used to work best and he might be able to pin-point where the speed issue is coming from?

it was a very long time ago that i used dual speed successfully. 7 months ago i have a video :D i can try maybe

BootsofLagrangian commented 1 month ago

Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.

FurkanGozukara commented 1 month ago

Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.

just asked them lets see what they tell. can we see it somehow on the machine with a command etc?

FurkanGozukara commented 1 month ago

Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.

ok it turns out all are PCIe. so i assume we can't get any better right?

BootsofLagrangian commented 1 month ago

Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.

ok it turns out all are PCIe. so i assume we can't get any better right?

Okay, there is a hardware bottleneck. And I think you can get faster total training time using two H100s, not time per step.

i.e. One H100 : 1.27 s/it vs Two H100s 2.07 s/it => two independ H100s 2.54 s/it < (faster) < Two DDP H100s 2.07 s/it

If you have a budget to buy NVLink, it is faster way to speed up your H100s. If you dont want to buy it, XD

Additionally, speed degradation due to communication is not your fault. It is just H100 has super faster memory bandwidth than PCIe, e.g. H100 (2TB /s) vs PCIe 4.0 ( 16Gb / s )

FurkanGozukara commented 1 month ago

@BootsofLagrangian it is not like i purchased them i am using on Massed Compute :)

They said they have SXM4 A100. I will test the script there. It is supposed to not get degraded speed like this. We will see :)

BootsofLagrangian commented 1 month ago

@BootsofLagrangian it is not like i purchased them i am using on Massed Compute :)

They said they have SXM4 A100. I will test the script there. It is supposed to not get degraded speed like this. We will see :)

Most of SXM4 system runs on interconnected device(NVLink, NVSwitch). So no degradation is natural, but most of PCIe system dose not. PCIe powered GPU needs external interlink device.

FurkanGozukara commented 1 month ago

started a machine will try to test now

FurkanGozukara commented 1 month ago

@kohya-ss the training fails on a SXM4 machine :(

when 1 gpu is used it works

here batch size 7 speed

When I try 2 GPU like below it fails

tested all of the dynamo backends all failed

00:43:55-975677 INFO     Start training Dreambooth...                           
00:43:55-976776 INFO     Validating lr scheduler arguments...                   
00:43:55-977355 INFO     Validating optimizer arguments...                      
00:43:55-977896 INFO     Validating /home/Ubuntu/Desktop/results existence and  
                         writability... SUCCESS                                 
00:43:55-978494 INFO     Validating                                             
                         /home/Ubuntu/Downloads/RealVisXL_V4.0.safetensors      
                         existence... SUCCESS                                   
00:43:55-979055 INFO     Validating /home/Ubuntu/Desktop/train_imgs existence...
                         SUCCESS                                                
00:43:55-979627 INFO     Validating stabilityai/sdxl-vae existence... SKIPPING: 
                         huggingface.co model                                   
00:43:55-980219 INFO     Folder 1_ohwx man: 1 repeats found                     
00:43:55-981209 INFO     Folder 1_ohwx man: 480 images found                    
00:43:55-981777 INFO     Folder 1_ohwx man: 480 * 1 = 480 steps                 
00:43:55-982305 INFO     Regulatization factor: 1                               
00:43:55-982809 INFO     Total steps: 480                                       
00:43:55-983280 INFO     Train batch size: 7                                    
00:43:55-983730 INFO     Gradient accumulation steps: 1                         
00:43:55-984195 INFO     Epoch: 400                                             
00:43:55-984662 INFO     max_train_steps (480 / 7 / 1 * 400 * 1) = 27429        
00:43:55-985243 INFO     lr_warmup_steps = 0                                    
00:43:55-986084 INFO     Saving training config to                              
                         /home/Ubuntu/Desktop/results/2_gpu_20240723-004355.json
                         ...                                                    
00:43:55-986976 INFO     Executing command:                                     
                         /home/Ubuntu/Desktop/kohya_ss/venv/bin/accelerate      
                         launch --dynamo_backend no --dynamo_mode default       
                         --gpu_ids 0,1 --mixed_precision bf16 --multi_gpu       
                         --num_processes 2 --num_machines 1                     
                         --num_cpu_threads_per_process 4                        
                         /home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py 
                         --config_file                                          
                         /home/Ubuntu/Desktop/results/config_dreambooth-20240723
                         -004355.toml --max_grad_norm=0.0 --no_half_vae         
                         --train_text_encoder --learning_rate_te2=0             
00:43:55-988822 INFO     Command executed.                                      
2024-07-23 00:44:03.295128: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-23 00:44:03.295172: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-23 00:44:03.296055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-23 00:44:03.301057: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-23 00:44:03.364694: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-23 00:44:03.364751: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-23 00:44:03.367112: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-23 00:44:03.374010: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-23 00:44:03.932149: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-23 00:44:04.062733: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-23 00:44:04 INFO     Loading settings from            train_util.py:3744
                             /home/Ubuntu/Desktop/results/con                   
                             fig_dreambooth-20240723-004355.t                   
                             oml...                                             
                    INFO     /home/Ubuntu/Desktop/results/con train_util.py:3763
                             fig_dreambooth-20240723-004355                     
                    WARNING  clip_skip will be unexpected sdxl_train_util.py:343
                             /                                                  
                             SDXL学習ではclip_skipは動作                        
                             しません                                           
2024-07-23 00:44:04 INFO     prepare tokenizers           sdxl_train_util.py:134
2024-07-23 00:44:04 INFO     Loading settings from            train_util.py:3744
                             /home/Ubuntu/Desktop/results/con                   
                             fig_dreambooth-20240723-004355.t                   
                             oml...                                             
                    INFO     /home/Ubuntu/Desktop/results/con train_util.py:3763
                             fig_dreambooth-20240723-004355                     
                    WARNING  clip_skip will be unexpected sdxl_train_util.py:343
                             /                                                  
                             SDXL学習ではclip_skipは動作                        
                             しません                                           
2024-07-23 00:44:04 INFO     prepare tokenizers           sdxl_train_util.py:134
                    INFO     update token length: 75      sdxl_train_util.py:159
                    INFO     Using DreamBooth method.          sdxl_train.py:144
2024-07-23 00:44:05 INFO     prepare images.                  train_util.py:1572
                    INFO     found directory                  train_util.py:1519
                             /home/Ubuntu/Desktop/train_imgs/                   
                             1_ohwx man contains 480 image                      
                             files                                              
2024-07-23 00:44:05 INFO     update token length: 75      sdxl_train_util.py:159
                    WARNING  No caption file found for 480    train_util.py:1550
                             images. Training will continue                     
                             without captions for these                         
                             images. If class token exists,                     
                             it will be used. /                                 
                             480枚の画像にキャプションファイ                    
                             ルが見つかりませんでした。これら                   
                             の画像についてはキャプションなし                   
                             で学習を続行します。class                          
                             tokenが存在する場合はそれを使い                    
                             ます。                                             
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (10th copy).jpg                                    
                    INFO     Using DreamBooth method.          sdxl_train.py:144
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (11th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (12th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (13th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (14th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1555
                             1_ohwx man/IMG_20230430_134600                     
                             (15th copy).jpg... and 475 more                    
                    INFO     480 train images with repeating. train_util.py:1613
                    INFO     0 reg images.                    train_util.py:1616
                    WARNING  no regularization images /       train_util.py:1621
                             正則化画像が見つかりませんでした                   
                    INFO     [Dataset 0]                      config_util.py:565
                               batch_size: 7                                    
                               resolution: (1024, 1024)                         
                               enable_bucket: False                             
                               network_multiplier: 1.0                          

                               [Subset 0 of Dataset 0]                          
                                 image_dir:                                     
                             "/home/Ubuntu/Desktop/train_imgs                   
                             /1_ohwx man"                                       
                                 image_count: 480                               
                                 num_repeats: 1                                 
                                 shuffle_caption: False                         
                                 keep_tokens: 0                                 
                                 keep_tokens_separator:                         
                                 secondary_separator: None                      
                                 enable_wildcard: False                         
                                 caption_dropout_rate: 0.0                      
                                 caption_dropout_every_n_epoc                   
                             hes: 0                                             
                                 caption_tag_dropout_rate:                      
                             0.0                                                
                                 caption_prefix: None                           
                                 caption_suffix: None                           
                                 color_aug: False                               
                                 flip_aug: False                                
                                 face_crop_aug_range: None                      
                                 random_crop: False                             
                                 token_warmup_min: 1,                           
                                 token_warmup_step: 0,                          
                                 is_reg: False                                  
                                 class_tokens: ohwx man                         
                                 caption_extension: .caption                    

                    INFO     [Dataset 0]                      config_util.py:571
                    INFO     loading image sizes.              train_util.py:853
100%|█████████████████████████████████████| 480/480 [00:00<00:00, 107174.12it/s]
                    INFO     prepare dataset                   train_util.py:861
                    INFO     prepare accelerator               sdxl_train.py:201
accelerator device: cuda:0
                    INFO     loading model for process 0/2 sdxl_train_util.py:30
                    INFO     load StableDiffusion          sdxl_train_util.py:70
                             checkpoint:                                        
                             /home/Ubuntu/Downloads/RealVi                      
                             sXL_V4.0.safetensors                               
                    INFO     building U-Net               sdxl_model_util.py:192
                    INFO     loading U-Net from           sdxl_model_util.py:196
                             checkpoint                                         
                    INFO     prepare images.                  train_util.py:1572
                    INFO     found directory                  train_util.py:1519
                             /home/Ubuntu/Desktop/train_imgs/                   
                             1_ohwx man contains 480 image                      
                             files                                              
                    WARNING  No caption file found for 480    train_util.py:1550
                             images. Training will continue                     
                             without captions for these                         
                             images. If class token exists,                     
                             it will be used. /                                 
                             480枚の画像にキャプションファイ                    
                             ルが見つかりませんでした。これら                   
                             の画像についてはキャプションなし                   
                             で学習を続行します。class                          
                             tokenが存在する場合はそれを使い                    
                             ます。                                             
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (10th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (11th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (12th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (13th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
                             1_ohwx man/IMG_20230430_134600                     
                             (14th copy).jpg                                    
                    WARNING  /home/Ubuntu/Desktop/train_imgs/ train_util.py:1555
                             1_ohwx man/IMG_20230430_134600                     
                             (15th copy).jpg... and 475 more                    
                    INFO     480 train images with repeating. train_util.py:1613
                    INFO     0 reg images.                    train_util.py:1616
                    WARNING  no regularization images /       train_util.py:1621
                             正則化画像が見つかりませんでした                   
                    INFO     [Dataset 0]                      config_util.py:565
                               batch_size: 7                                    
                               resolution: (1024, 1024)                         
                               enable_bucket: False                             
                               network_multiplier: 1.0                          

                               [Subset 0 of Dataset 0]                          
                                 image_dir:                                     
                             "/home/Ubuntu/Desktop/train_imgs                   
                             /1_ohwx man"                                       
                                 image_count: 480                               
                                 num_repeats: 1                                 
                                 shuffle_caption: False                         
                                 keep_tokens: 0                                 
                                 keep_tokens_separator:                         
                                 secondary_separator: None                      
                                 enable_wildcard: False                         
                                 caption_dropout_rate: 0.0                      
                                 caption_dropout_every_n_epoc                   
                             hes: 0                                             
                                 caption_tag_dropout_rate:                      
                             0.0                                                
                                 caption_prefix: None                           
                                 caption_suffix: None                           
                                 color_aug: False                               
                                 flip_aug: False                                
                                 face_crop_aug_range: None                      
                                 random_crop: False                             
                                 token_warmup_min: 1,                           
                                 token_warmup_step: 0,                          
                                 is_reg: False                                  
                                 class_tokens: ohwx man                         
                                 caption_extension: .caption                    

                    INFO     [Dataset 0]                      config_util.py:571
                    INFO     loading image sizes.              train_util.py:853
100%|█████████████████████████████████████| 480/480 [00:00<00:00, 103345.10it/s]
                    INFO     prepare dataset                   train_util.py:861
                    INFO     prepare accelerator               sdxl_train.py:201
accelerator device: cuda:1
2024-07-23 00:44:06 INFO     U-Net: <All keys matched     sdxl_model_util.py:202
                             successfully>                                      
                    INFO     building text encoders       sdxl_model_util.py:205
                    INFO     loading text encoders from   sdxl_model_util.py:258
                             checkpoint                                         
                    INFO     text encoder 1: <All keys    sdxl_model_util.py:272
                             matched successfully>                              
                    INFO     text encoder 2: <All keys    sdxl_model_util.py:276
                             matched successfully>                              
                    INFO     building VAE                 sdxl_model_util.py:279
                    INFO     loading VAE from checkpoint  sdxl_model_util.py:284
                    INFO     VAE: <All keys matched       sdxl_model_util.py:287
                             successfully>                                      
                    INFO     load VAE: stabilityai/sdxl-vae   model_util.py:1268
                    INFO     additional VAE loaded        sdxl_train_util.py:128
2024-07-23 00:44:07 INFO     loading model for process 1/2 sdxl_train_util.py:30
                    INFO     load StableDiffusion          sdxl_train_util.py:70
                             checkpoint:                                        
                             /home/Ubuntu/Downloads/RealVi                      
                             sXL_V4.0.safetensors                               
                    INFO     building U-Net               sdxl_model_util.py:192
                    INFO     loading U-Net from           sdxl_model_util.py:196
                             checkpoint                                         
2024-07-23 00:44:08 INFO     U-Net: <All keys matched     sdxl_model_util.py:202
                             successfully>                                      
                    INFO     building text encoders       sdxl_model_util.py:205
                    INFO     loading text encoders from   sdxl_model_util.py:258
                             checkpoint                                         
2024-07-23 00:44:09 INFO     text encoder 1: <All keys    sdxl_model_util.py:272
                             matched successfully>                              
                    INFO     text encoder 2: <All keys    sdxl_model_util.py:276
                             matched successfully>                              
                    INFO     building VAE                 sdxl_model_util.py:279
                    INFO     loading VAE from checkpoint  sdxl_model_util.py:284
                    INFO     VAE: <All keys matched       sdxl_model_util.py:287
                             successfully>                                      
                    INFO     load VAE: stabilityai/sdxl-vae   model_util.py:1268
                    INFO     additional VAE loaded        sdxl_train_util.py:128
Disable Diffusers' xformers
                    INFO     Enable xformers for U-Net        train_util.py:2660
2024-07-23 00:44:09 INFO     Enable xformers for U-Net        train_util.py:2660
                    INFO     [Dataset 0]                      train_util.py:2079
                    INFO     caching latents.                  train_util.py:974
                    INFO     checking cache validity...        train_util.py:984
100%|█████████████████████████████████████| 480/480 [00:00<00:00, 946083.61it/s]
                    INFO     [Dataset 0]                      train_util.py:2079
                    INFO     caching latents.                  train_util.py:974
                    INFO     checking cache validity...        train_util.py:984
100%|███████████████████████████████████████| 480/480 [00:00<00:00, 2265.38it/s]
2024-07-23 00:44:10 INFO     caching latents...               train_util.py:1021
0it [00:00, ?it/s]
enable text encoder training
2024-07-23 00:44:10 INFO     use Adafactor optimizer |        train_util.py:4047
                             {'scale_parameter': False,                         
                             'relative_step': False,                            
                             'warmup_init': False,                              
                             'weight_decay': 0.01}                              
                    WARNING  constant_with_warmup will be     train_util.py:4079
                             good /                                             
                             スケジューラはconstant_with_warm                   
                             upが良いかもしれません                             
train unet: True, text_encoder1: True, text_encoder2: False
number of models: 2
number of trainable parameters: 2690524164
prepare optimizer, data loader etc.
                    INFO     use Adafactor optimizer |        train_util.py:4047
                             {'scale_parameter': False,                         
                             'relative_step': False,                            
                             'warmup_init': False,                              
                             'weight_decay': 0.01}                              
                    WARNING  constant_with_warmup will be     train_util.py:4079
                             good /                                             
                             スケジューラはconstant_with_warm                   
                             upが良いかもしれません                             
enable full bf16 training.
running training / 学習開始
  num examples / サンプル数: 480
  num batches per epoch / 1epochのバッチ数: 35
  num epochs / epoch数: 784
  batch size per device / バッチサイズ: 7
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 27429
steps:   0%|                                          | 0/27429 [00:00<?, ?it/s]
epoch 1/784
Traceback (most recent call last):
  File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py", line 818, in <module>
    train(args)
  File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py", line 591, in train
    noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 680, in forward
    return model_forward(*args, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 668, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/library/sdxl_original_unet.py", line 1079, in forward
    t_emb = get_timestep_embedding(timesteps, self.model_channels, downscale_freq_shift=0)  # , repeat_only=False)
  File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/library/sdxl_original_unet.py", line 257, in get_timestep_embedding
    exponent = exponent / (half_dim - downscale_freq_shift)
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a71e4e84617 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a71e4e3f98d in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a71e4f35c38 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7a717413c8b0 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7a71741406d8 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7a7174156f70 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7a7174157278 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7a71e44dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7a7210094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7a7210126850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a71e4e84617 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a71e4e3f98d in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a71e4f35c38 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7a717413c8b0 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7a71741406d8 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7a7174156f70 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7a7174157278 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7a71e44dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7a7210094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7a7210126850 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-07-23 00:44:15,053] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 14854 closing signal SIGTERM
[2024-07-23 00:44:15,618] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 14855) of binary: /home/Ubuntu/Desktop/kohya_ss/venv/bin/python
Traceback (most recent call last):
  File "/home/Ubuntu/Desktop/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-23_00:44:15
  host      : 0229-dsm-prxmx30035
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 14855)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 14855
============================================================
00:44:17-252060 INFO     Training has ended.

WarAnakin commented 1 month ago

I've been training multi-gpu for months using both the gui and the CLI. I think this issue might be related to the CUDA version itself more than kohya. I've had this happen to me once in the past where it couldn't register some specific cuda services. I mostly use runpod and i don't have any issues whether it's H100 NVL, PCIE, SXM. Most of the time i train with 6x L40S since it is faster, cheaper and more memory than 3x H100. What i'd like to know is how to enable Sparsity since Sparsity doubles the performance of FP operations.

bluvoll commented 1 month ago

The lack of FlashAttention 3 is rearing its ugly head, we don't even have TMA for the H100 in kohya among other stuff.

FurkanGozukara commented 1 month ago

I've been training multi-gpu for months using both the gui and the CLI. I think this issue might be related to the CUDA version itself more than kohya. I've had this happen to me once in the past where it couldn't register some specific cuda services. I mostly use runpod and i don't have any issues whether it's H100 NVL, PCIE, SXM. Most of the time i train with 6x L40S since it is faster, cheaper and more memory than 3x H100. What i'd like to know is how to enable Sparsity since Sparsity doubles the performance of FP operations.

multi gpu training worked on PCIe machine on massed compute . but with SXM i got above error. do you know how to fix? how do you setup your accelerator?

what cuda version you have on your SXM machine?

WarAnakin commented 1 month ago

@bmaltais there is nothing wrong with your interface or kohya's script, you've done a great job, altho' some descriptions you have in there are not totally accurate but that's not your fault.

Disty0 commented 1 month ago

CUDA error: uncorrectable ECC error encountered

This is an hardware error. You should contact the compute provider because you've got a faulty node.

FurkanGozukara commented 1 month ago

CUDA error: uncorrectable ECC error encountered
This is an hardware error. You should contact the compute provider because you've got a faulty node.

thanks i did. it could be reason

kohya-ss / sd-scripts

Training on 2x H100 on Ubuntu and speed is same as 1x H100 what we are doing wrong? #1434