Open kopyl opened 1 month ago
cc: @sayakpaul
Cc: @linoytsaban
@linoytsaban I can call to debug it together on my hardware if needed :)
The default NCCL Timeout duration is 600 seconds: here. Sometimes validation on multiple prompts, or saving a FSDP model, can take longer than this. I would suggest to increase this timeout to 1800 seconds, which usually fixes any timeout problems for me.
You can do this by:
+ from accelerate.utils import InitProcessGroupKwargs
+ from datetime import timedelta
...
accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+ init_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1800))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
- kwargs_handlers=[ddp_kwargs],
+ kwargs_handlers=[ddp_kwargs, init_kwargs],
)
You might, sometimes, also have communication timeouts when using multi-GPU training. I've found using NCCL_P2P_DISABLE=1
fixes. But, I don't fully understand the details/consequences so please look into appropriate docs if you get any errors regarding this.
So I should have NCCL_P2P_DISABLE=1 as an env variable, correct? Like
NCCL_P2P_DISABLE=1 accelerate ...
That should be used when/if you experience any communication timeouts (but should be safe to use anyway). Currently, you're experiencing stale timeouts (because allgather did not happen for 600 seconds) which should be fixable, hopefully, by passing InitProcessGroupKwargs with a timeout of 1800 seconds.
@a-r-r-o-w setting a timeout to 3600 sec did not help :(
I launched the training with this command:
MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"
!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
--pretrained_model_name_or_path={MODEL_NAME} \
--instance_data_dir={INSTANCE_DIR} \
--output_dir={OUTPUT_DIR} \
--mixed_precision="bf16" \
--instance_prompt="a photo of sks girl" \
--resolution=512 \
--train_batch_size=1 \
--guidance_scale=1 \
--gradient_accumulation_steps=4 \
--optimizer="prodigy" \
--learning_rate=1. \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=2 \
--seed="0" \
--checkpointing_steps=1
It created an empty directory at path /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1/pytorch_model_fsdp_0
and stayed empty for 60 minutes. Logs: https://pastebin.com/9SG3tehh
Then it timed out. Logs after the timeout: https://pastebin.com/QU1mJP8U .
Then I tried running the command with NCCL_P2P_DISABLE=1
like this:
!NCCL_P2P_DISABLE=1 accelerate launch examples/dreambooth/train_dreambooth_flux.py \
--pretrained_model_name_or_path={MODEL_NAME} \
--instance_data_dir={INSTANCE_DIR} \
--output_dir={OUTPUT_DIR} \
--mixed_precision="bf16" \
--instance_prompt="a photo of sks girl" \
--resolution=512 \
--train_batch_size=1 \
--guidance_scale=1 \
--gradient_accumulation_steps=4 \
--optimizer="prodigy" \
--learning_rate=1. \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=2 \
--seed="0" \
--checkpointing_steps=1
The same error. Did not save anything for 1 hour. I refuse to believe that saving a checkpoint takes more than 1 hour. Can it be true? Logs.
Please share a config for training Flux on x1 H100 NVL (95 GB VRAM), utilizing the max out of a single GPU before using a CPU.
Just tried training on a signle H100 with this accelerate config:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
And I got this error:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
09/25/2024 13:41:54 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 15534.46it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00, 1.63it/s]
Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 7117.03it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py:440: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
warnings.warn(
09/25/2024 13:42:09 - INFO - __main__ - ***** Running training *****
09/25/2024 13:42:09 - INFO - __main__ - Num examples = 10
09/25/2024 13:42:09 - INFO - __main__ - Num batches each epoch = 10
09/25/2024 13:42:09 - INFO - __main__ - Num Epochs = 1
09/25/2024 13:42:09 - INFO - __main__ - Instantaneous batch size per device = 1
09/25/2024 13:42:09 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
09/25/2024 13:42:09 - INFO - __main__ - Gradient Accumulation steps = 4
09/25/2024 13:42:09 - INFO - __main__ - Total optimization steps = 1
Steps: 0%| | 0/1 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
Steps: 0%| | 0/1 [01:14<?, ?it/s, loss=0.51, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 0/1 [01:17<?, ?it/s, loss=0.417, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 0/1 [01:20<?, ?it/s, loss=0.338, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
E0925 13:44:45.988664 140466987263808 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 48214) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-25_13:44:45
host : x1-h100.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 48214)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 48214
======================================================
@a-r-r-o-w i just tried a good old sd 1.5 dreambooth on x1 H100 with this command:
MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
INSTANCE_DIR="/home/azureuser/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/home/azureuser/sd15-dreambooth-outputs/dreamboot-yaremovaa"
CLASS_DIR="/home/azureuser/dreambooth-datasets-class/girl"
!NCCL_P2P_DISABLE=1 accelerate launch examples/dreambooth/train_dreambooth.py \
--pretrained_model_name_or_path={MODEL_NAME} \
--instance_data_dir={INSTANCE_DIR} \
--output_dir={OUTPUT_DIR} \
--class_data_dir={CLASS_DIR} \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks girl" \
--class_prompt="a photo of girl" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 --gradient_checkpointing \
--enable_xformers_memory_efficient_attention \
--set_grads_to_none \
--learning_rate=2e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=10
The training went smoothly and saved the checkpoint. I'm wondering what might be the cause. Is it my accelerate config or faulty H100 or anything else...
Do you have any ideas how can I debug it?
@a-r-r-o-w i just tried running sd 1.5 dreambooth training with a basic config from this command:
from accelerate.utils import write_basic_config
write_basic_config()
The config looks like this:
{
"compute_environment": "LOCAL_MACHINE",
"debug": false,
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"enable_cpu_affinity": false,
"machine_rank": 0,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 1,
"num_processes": 2,
"rdzv_backend": "static",
"same_network": false,
"tpu_use_cluster": false,
"tpu_use_sudo": false,
"use_cpu": false
}
So yeah, seems like something is going on with my config. Do you have any ideas how can I change it to fit my training on x2 H100 NVL (each 95 GB VRAM) and avoid having timeouts?
https://github.com/huggingface/accelerate/issues/2787 might be relevant.
@a-r-r-o-w with the default accelerate config the Drembooth Flux training does not even start. Getting this error:
[W925 14:52:57.109944407 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W925 14:52:57.109972687 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W925 14:52:57.134315258 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W925 14:52:57.134335928 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
09/25/2024 14:52:57 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
09/25/2024 14:52:57 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: bf16
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 14193.92it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 13774.40it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00, 1.31s/it]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10246.67it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00, 1.31s/it]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10123.02it/s]
Using decoupled weight decay
Using decoupled weight decay
x2-h100:26855:26855 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26855:26855 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
x2-h100:26855:26855 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x2-h100:26855:26855 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
x2-h100:26856:26856 [1] NCCL INFO cudaDriverVersion 12020
x2-h100:26856:26856 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26856:26856 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
x2-h100:26856:26856 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x2-h100:26855:27239 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
x2-h100:26855:27239 [0] NCCL INFO Failed to open libibverbs.so[.1]
x2-h100:26855:27239 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26855:27239 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
x2-h100:26855:27239 [0] NCCL INFO Using non-device net plugin version 0
x2-h100:26855:27239 [0] NCCL INFO Using network Socket
x2-h100:26856:27240 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
x2-h100:26856:27240 [1] NCCL INFO Failed to open libibverbs.so[.1]
x2-h100:26856:27240 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:26856:27240 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
x2-h100:26856:27240 [1] NCCL INFO Using non-device net plugin version 0
x2-h100:26856:27240 [1] NCCL INFO Using network Socket
x2-h100:26856:27240 [1] NCCL INFO comm 0xaac9220 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xcd8f216bbce7889c - Init START
x2-h100:26855:27239 [0] NCCL INFO comm 0x9836790 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xcd8f216bbce7889c - Init START
x2-h100:26856:27240 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
x2-h100:26856:27240 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffff00,00000000
x2-h100:26855:27239 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
x2-h100:26855:27239 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
x2-h100:26856:27240 [1] NCCL INFO comm 0xaac9220 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
x2-h100:26855:27239 [0] NCCL INFO comm 0x9836790 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
x2-h100:26856:27240 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
x2-h100:26855:27239 [0] NCCL INFO Channel 00/04 : 0 1
x2-h100:26856:27240 [1] NCCL INFO P2P Chunksize set to 131072
x2-h100:26855:27239 [0] NCCL INFO Channel 01/04 : 0 1
x2-h100:26855:27239 [0] NCCL INFO Channel 02/04 : 0 1
x2-h100:26855:27239 [0] NCCL INFO Channel 03/04 : 0 1
x2-h100:26855:27239 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
x2-h100:26855:27239 [0] NCCL INFO P2P Chunksize set to 131072
x2-h100:26856:27240 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26856:27240 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26856:27240 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26856:27240 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:26855:27239 [0] NCCL INFO Connected all rings
x2-h100:26855:27239 [0] NCCL INFO Connected all trees
x2-h100:26856:27240 [1] NCCL INFO Connected all rings
x2-h100:26856:27240 [1] NCCL INFO Connected all trees
x2-h100:26856:27240 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x2-h100:26856:27240 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x2-h100:26855:27239 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x2-h100:26855:27239 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x2-h100:26855:27239 [0] NCCL INFO comm 0x9836790 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xcd8f216bbce7889c - Init COMPLETE
x2-h100:26856:27240 [1] NCCL INFO comm 0xaac9220 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xcd8f216bbce7889c - Init COMPLETE
[rank0]:[W925 14:53:20.324185688 Utils.hpp:110] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[rank1]:[W925 14:53:20.324948373 Utils.hpp:110] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
09/25/2024 14:53:20 - INFO - __main__ - ***** Running training *****
09/25/2024 14:53:20 - INFO - __main__ - Num examples = 10
09/25/2024 14:53:20 - INFO - __main__ - Num batches each epoch = 5
09/25/2024 14:53:20 - INFO - __main__ - Num Epochs = 1
09/25/2024 14:53:20 - INFO - __main__ - Instantaneous batch size per device = 1
09/25/2024 14:53:20 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
09/25/2024 14:53:20 - INFO - __main__ - Gradient Accumulation steps = 4
09/25/2024 14:53:20 - INFO - __main__ - Total optimization steps = 2
Steps: 0%| | 0/2 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank0]: main(args)
[rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1585, in main
[rank0]: if transformer.config.guidance_embeds:
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank0]: raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'config'
[rank1]: Traceback (most recent call last):
[rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank1]: main(args)
[rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1585, in main
[rank1]: if transformer.config.guidance_embeds:
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank1]: raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'config'
Steps: 0%| | 0/2 [00:00<?, ?it/s]
[rank0]:[W925 14:53:21.913382942 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
x2-h100:26855:27245 [0] NCCL INFO [Service thread] Connection closed by localRank 0
x2-h100:26856:27244 [1] NCCL INFO [Service thread] Connection closed by localRank 1
x2-h100:26855:27271 [0] NCCL INFO comm 0x9836790 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
x2-h100:26856:27272 [1] NCCL INFO comm 0xaac9220 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
W0925 14:53:22.304201 140655254574912 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 26856 closing signal SIGTERM
E0925 14:53:22.618849 140655254574912 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 26855) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-25_14:53:22
host : x2-h100.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 26855)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@sayakpaul with your config i still can't run the training. Getting this error: https://pastebin.com/xaqNSh9K
@a-r-r-o-w i commented out if transformer.config.guidance_embeds: and now I run out of memory.
I have x2 H100 NVL and it's still not enough. Please share an accelerate config I can use to run the training without running out of memory and renting more powerful GPU servers. I'd really appreciate it.
@sayakpaul same thing with your config. I'm running out of memory :(
@a-r-r-o-w setting a timeout to 3600 sec did not help :( The same error. Did not save anything for 1 hour. I refuse to believe that saving a checkpoint takes more than 1 hour. Can it be true? Logs.
I'm sorry for the inconvenience this causes. Our training scripts serve as minimal examples of training and are not the end solution for training with different configurations. They are usually tested on basic uncompiled/compiled single GPU training scenarios only. The expectation is that people wanting to train seriously will adopt it to their use cases and make the best of it. So, things like DeepSpeed/FSDP may not work out of the box, and might require extra efforts from your end to make it compatible. I think tailoring the script to your needs is the best way to go about it for FSDP or any other training configuration.
i commented out if transformer.config.guidance_embeds: and now I run out of memory.
This happens because DeepSpeed/FSDP wrap the underlying object in a new class. You can see here how it's done for DeepSpeed. You might have to do something similar to access underlying config object when using FSDP.
@a-r-r-o-w Thank you. Do you have any idea why with FSDP it doesn't save a model with save_pretrained
method? Maybe there is also some different way to access it?
@a-r-r-o-w also it would be nice to see the exact config the training was tested on, so I can reproduce it. Both hardware-wise and software-wise.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
So, this is how we're doing it here: https://github.com/a-r-r-o-w/cogvideox-factory/blob/0affacb2296027fc40a6f3900ce9157b4f4ea46d/training/cogvideox_image_to_video_lora.py#L382
Can you try regarding #9829 ? I have saved memory by implementing this :)
Describe the bug
I run the training but get this error
Reproduction
Run accelerate config
Logs
System Info
Ubuntu 20.04 x2 NVIDIA H100 CUDA 12.2 torch==2.4.1 torchvision==0.19.1 Diffusers commit: https://github.com/huggingface/diffusers/commit/ba5af5aebbac0cc18168076a18836f175753d1c7
Who can help?
No response