Multi GPU Training Fails - RuntimeErrordist._broadcast_coalesced(: Invalid scalar type - RuntimeError: Invalid scalar type

FurkanGozukara commented 10 months ago

I have a subscriber who has dual RTX 4060 Ti - 16 GB

He is on Windows 10 and Python 3.10.9 - fresh install

When we set the huggingface default_config.yaml like below

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

train util.py like below

    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=log_with,
        project_dir=logging_dir,
        kwargs_handlers=[InitProcessGroupKwargs(backend="gloo")],
    )

We are getting the below error. How can we fix it?

                         --ddp_gradient_as_bucket_view
NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [pc]:29500 (system error: 10049 - unknown error).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [pc]:29500 (system error: 10049 - unknown error).
prepare tokenizersprepare tokenizers

Using DreamBooth method.
Using DreamBooth method.
prepare images.
found directory C:\Users\user\Desktop\model\img\40_ohwx man contains 50 image files
No caption file found for 50 images. Training will continue without captions for these images. If class token exists, it will be used. / 50枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。
C:\Users\user\Desktop\model\img\40_ohwx man\man_10001.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10002.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10003.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10004.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10005.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10006.jpg... and 45 more
2000 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (1024, 1024)
  enable_bucket: False

  [Subset 0 of Dataset 0]
    image_dir: "C:\Users\user\Desktop\model\img\40_ohwx man"
    image_count: 50
    num_repeats: 40
    shuffle_caption: False
    keep_tokens: 0
    keep_tokens_separator:
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: ohwx man
    caption_extension: .caption

[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 1612.82it/s]
prepare dataset
prepare accelerator
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [pc]:29500 (system error: 10049 - unknown error).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [pc]:29500 (system error: 10049 - unknown error).
prepare images.
found directory C:\Users\user\Desktop\model\img\40_ohwx man contains 50 image files
No caption file found for 50 images. Training will continue without captions for these images. If class token exists, it will be used. / 50枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。
C:\Users\user\Desktop\model\img\40_ohwx man\man_10001.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10002.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10003.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10004.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10005.jpg
C:\Users\user\Desktop\model\img\40_ohwx man\man_10006.jpg... and 45 more
2000 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (1024, 1024)
  enable_bucket: False

  [Subset 0 of Dataset 0]
    image_dir: "C:\Users\user\Desktop\model\img\40_ohwx man"
    image_count: 50
    num_repeats: 40
    shuffle_caption: False
    keep_tokens: 0
    keep_tokens_separator:
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: ohwx man
    caption_extension: .caption

[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 6247.66it/s]
prepare dataset
prepare accelerator
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [pc]:29500 (system error: 10049 - unknown error).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [pc]:29500 (system error: 10049 - unknown error).
loading model for process 0/2
load StableDiffusion checkpoint: C:/Users/user/Downloads/sd_xl_base_1.0.safetensors
building U-Net
loading U-Net from checkpoint
U-Net:  <All keys matched successfully>
building text encoders
loading text encoders from checkpoint
text encoder 1: <All keys matched successfully>
text encoder 2: <All keys matched successfully>
building VAE
loading VAE from checkpoint
VAE: <All keys matched successfully>
load VAE: stabilityai/sdxl-vae
additional VAE loaded
loading model for process 1/2
load StableDiffusion checkpoint: C:/Users/user/Downloads/sd_xl_base_1.0.safetensors
building U-Net
loading U-Net from checkpoint
U-Net:  <All keys matched successfully>
building text encoders
loading text encoders from checkpoint
text encoder 1: <All keys matched successfully>
text encoder 2: <All keys matched successfully>
building VAE
loading VAE from checkpoint
VAE: <All keys matched successfully>
load VAE: stabilityai/sdxl-vae
additional VAE loaded
Disable Diffusers' xformers
Enable xformers for U-NetEnable xformers for U-Net

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
[Dataset 0]
caching latents.
checking cache validity...
  0%|                                                                                           | 0/50 [00:00<?, ?it/s][Dataset 0]
caching latents.
checking cache validity...
100%|██████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 333.33it/s]
caching latents...
0it [00:00, ?it/s]
enable text encoder training
train unet: True, text_encoder1: True, text_encoder2: False
number of models: 2
number of trainable parameters: 2690524164
prepare optimizer, data loader etc.
use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01}
constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません
enable full bf16 training.
Traceback (most recent call last):
  File "C:\Users\user\kohya_ss\sdxl_train.py", line 782, in <module>
Traceback (most recent call last):
  File "C:\Users\user\kohya_ss\sdxl_train.py", line 782, in <module>
    train(args)
  File "C:\Users\user\kohya_ss\sdxl_train.py", line 399, in train
    train(args)
  File "C:\Users\user\kohya_ss\sdxl_train.py", line 399, in train
    unet = accelerator.prepare(unet)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1284, in prepare
    unet = accelerator.prepare(unet)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1284, in prepare
        result = tuple(
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1285, in <genexpr>
result = tuple(
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1285, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1090, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1090, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1429, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1429, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\parallel\distributed.py", line 676, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\parallel\distributed.py", line 676, in __init__
    _sync_module_states(    _sync_module_states(

  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\utils.py", line 142, in _sync_module_states
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\utils.py", line 142, in _sync_module_states
    _sync_params_and_buffers(
      File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\utils.py", line 160, in _sync_params_and_buffers
_sync_params_and_buffers(
dist._broadcast_coalesced(  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\utils.py", line 160, in _sync_params_and_buffers

    RuntimeErrordist._broadcast_coalesced(: Invalid scalar type

RuntimeError: Invalid scalar type
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8908) of binary: C:\Users\user\kohya_ss\venv\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\user\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./sdxl_train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-13_00:35:21
  host      : pc
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2948)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-13_00:35:21
  host      : pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8908)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kohya-ss commented 10 months ago

This error seems to be same as this issue: https://github.com/NVIDIA/NeMo/issues/5485

Please verify GPU version of PyTorch is installed.

FurkanGozukara commented 10 months ago

This error seems to be same as this issue: NVIDIA/NeMo#5485

Please verify GPU version of PyTorch is installed.

thank you. single GPU training working. I think it is accurately installed

here the venv

Microsoft Windows [Version 10.0.22631.2861]
(c) Microsoft Corporation. All rights reserved.

C:\Users\user\kohya_ss\venv\Scripts>activate

(venv) C:\Users\user\kohya_ss\venv\Scripts>pip freeze
absl-py==2.0.0
accelerate==0.23.0
aiofiles==23.2.1
aiohttp==3.9.1
aiosignal==1.3.1
altair==4.2.2
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
anyio==4.2.0
appdirs==1.4.4
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.41.1
cachetools==5.3.2
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.7
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.2.0
cycler==0.12.1
dadaptation==3.1
diffusers==0.24.0
docker-pycreds==0.4.0
easygui==0.98.3
einops==0.6.0
entrypoints==0.4
exceptiongroup==1.2.0
fairscale==0.4.13
fastapi==0.109.0
ffmpy==0.3.1
filelock==3.9.0
flatbuffers==23.5.26
fonttools==4.47.2
frozenlist==1.4.1
fsspec==2023.12.2
ftfy==6.1.1
gast==0.5.4
gitdb==4.0.11
GitPython==3.1.41
google-auth==2.26.2
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
gradio==3.36.1
gradio_client==0.8.0
grpcio==1.60.0
h11==0.14.0
h5py==3.10.0
httpcore==1.0.2
httpx==0.26.0
huggingface-hub==0.19.4
humanfriendly==10.0
idna==3.4
importlib-metadata==7.0.1
invisible-watermark==0.2.0
Jinja2==3.1.2
jsonschema==4.20.0
jsonschema-specifications==2023.12.1
keras==2.14.0
kiwisolver==1.4.5
libclang==16.0.6
-e git+https://github.com/bmaltais/kohya_ss.git@842d9c7018288d5c3d6e01adc0d7f886b70252b6#egg=library
lightning-utilities==0.10.0
linkify-it-py==2.0.2
lion-pytorch==0.0.6
lycoris-lora==2.0.2
Markdown==3.5.2
markdown-it-py==2.2.0
MarkupSafe==2.1.3
matplotlib==3.8.2
mdit-py-plugins==0.3.3
mdurl==0.1.2
ml-dtypes==0.2.0
mpmath==1.3.0
multidict==6.0.4
networkx==3.0
numpy==1.24.1
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.14.1
onnxruntime-gpu==1.16.0
open-clip-torch==2.20.0
opencv-python==4.7.0.68
opt-einsum==3.3.0
orjson==3.9.10
packaging==23.2
pandas==2.1.4
pathtools==0.1.2
Pillow==9.3.0
prodigyopt==1.0
protobuf==3.20.3
psutil==5.9.7
pyasn1==0.5.1
pyasn1-modules==0.3.0
pydantic==2.5.3
pydantic_core==2.14.6
pydub==0.25.1
Pygments==2.17.2
pyparsing==3.1.1
pyreadline3==3.4.1
python-dateutil==2.8.2
python-multipart==0.0.6
pytorch-lightning==1.9.0
pytz==2023.3.post1
PyWavelets==1.5.0
PyYAML==6.0.1
referencing==0.32.1
regex==2023.12.25
requests==2.28.1
requests-oauthlib==1.3.1
rich==13.4.1
rpds-py==0.17.1
rsa==4.9
safetensors==0.3.1
scipy==1.11.4
semantic-version==2.10.0
sentencepiece==0.1.99
sentry-sdk==1.39.2
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
starlette==0.35.1
sympy==1.12
tensorboard==2.14.1
tensorboard-data-server==0.7.2
tensorflow==2.14.0
tensorflow-estimator==2.14.0
tensorflow-intel==2.14.0
tensorflow-io-gcs-filesystem==0.31.0
termcolor==2.4.0
timm==0.6.12
tk==0.1.0
tokenizers==0.13.3
toml==0.10.2
toolz==0.12.0
torch==2.0.1+cu118
torchmetrics==1.3.0
torchvision==0.15.2+cu118
tqdm==4.66.1
transformers==4.30.2
typing_extensions==4.9.0
tzdata==2023.4
uc-micro-py==1.0.2
urllib3==1.26.13
uvicorn==0.25.0
voluptuous==0.13.1
wandb==0.15.11
wcwidth==0.2.13
websockets==11.0.3
Werkzeug==3.0.1
wrapt==1.14.1
xformers==0.0.21
yarl==1.9.4
zipp==3.17.0

(venv) C:\Users\user\kohya_ss\venv\Scripts>

kohya-ss commented 10 months ago

The user seems to modify the script to use gloo on Windows. According to this issue https://github.com/huggingface/accelerate/issues/141, it seems to be required to initialize torch.distributed. Unfortunately I don't know how to initilize it or use gloo on Windows...

FurkanGozukara commented 10 months ago

The user seems to modify the script to use gloo on Windows. According to this issue huggingface/accelerate#141, it seems to be required to initialize torch.distributed. Unfortunately I don't know how to initilize it or use gloo on Windows...

thanks looks like linux is mandatory atm

kohya-ss / sd-scripts

Multi GPU Training Fails - RuntimeErrordist._broadcast_coalesced(: Invalid scalar type - RuntimeError: Invalid scalar type #1047