kohya-ss / sd-scripts

Apache License 2.0
5.22k stars 867 forks source link

poolFD failed error when caching latents for flux finetuning #1601

Open huxian0402 opened 1 month ago

huxian0402 commented 1 month ago

@kohya-ss When I fine-tune Flux with 18,000 images, after caching the latents, the following error occurs. What could be the problem?Is this a bug, or is it because the data is too large, making caching latents too slow?

2024-09-15 17:26:03 INFO caching latents... train_util.py:1039

0%| | 0/13732 [00:00<?, ?it/s] 0%| | 1/13732 [00:06<26:41:11, 7.00s/it] 0%| | 2/13732 [00:07<1:11:27, 3.20it/s] 0%| | 3/13732 [00:07<55:25, 4.13it/s]

...

17%|█▋ | 2284/13732 [05:59<13:34, 14.05it/s] 17%|█▋ | 2286/13732 [05:59<13:34, 14.05it/s] 17%|█▋ | 2288/13732 [06:00<13:39, 13.96it/s] 17%|█▋ | 2290/13732 [06:00<13:06, 14.54it/s] 17%|█▋ | 2292/13732 [06:04<6:44:39, 2.12s/it] 17%|█▋ | 2293/13732 [06:07<10:17:04, 3.24s/it] 17%|█▋ | 2294/13732 [06:11<10:49:31, 3.41s/it] 17%|█▋ | 2295/13732 [06:14<9:54:20, 3.12s/it] [rank1]:[W915 17:32:18.928929035 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). [rank4]:[W915 17:32:18.928928746 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). [rank2]:[W915 17:32:18.928932595 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). rank3:[W915 17:32:18.928931444 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). [rank5]:[W915 17:32:18.929296522 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). rank3: Traceback (most recent call last): rank3: File "/home/project/sd-scripts_sd3/flux_train.py", line 905, in

rank3: File "/home/project/sd-scripts_sd3/flux_train.py", line 193, in train

rank3: File "/home/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 2564, in wait_for_everyone

rank3: File "/home/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/utils/other.py", line 138, in wait_for_everyone

rank3: File "/home/miniforge3/envs/flux/lib/python3.10/site-packages/accelerate/state.py", line 374, in wait_for_everyone

rank3: File "/home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper rank3: return func(*args, *kwargs) rank3: File "/home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3936, in barrier rank3: work = default_pg.barrier(opts=opts) rank3: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout rank3: Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:570 (most recent call first): rank3: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2f0cca3f86 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libc10.so) rank3: frame #1: + 0x16583cb (0x7f2f473d73cb in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2f4ba89b82 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2f4ba8ad71 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f4ba3f7c1 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f4ba3f7c1 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f4ba3f7c1 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2f4ba3f7c1 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7f2f0df70dbf in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) rank3: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7f2f0df7cb9c in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) rank3: frame #10: + 0x11acfff (0x7f2f0df84fff in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) rank3: frame #11: c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&) + 0x10 (0x7f2f0df86430 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) rank3: frame #12: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x69c (0x7f2f0df9353c in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) rank3: frame #13: + 0x5cb2ff2 (0x7f2f4ba31ff2 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #14: + 0x5cbd7f5 (0x7f2f4ba3c7f5 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #15: + 0x52dfa0b (0x7f2f4b05ea0b in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #16: + 0x52dd284 (0x7f2f4b05c284 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #17: + 0x1adf2b8 (0x7f2f4785e2b8 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #18: + 0x5cc7764 (0x7f2f4ba46764 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #19: + 0x5cc84f5 (0x7f2f4ba474f5 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) rank3: frame #20: + 0xdb3778 (0x7f2f5ed9d778 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_python.so) rank3: frame #21: + 0x4b1144 (0x7f2f5e49b144 in /home/miniforge3/envs/flux/lib/python3.10/site-packages/torch/lib/libtorch_python.so) rank3: frame #22: + 0x172df4 (0x56400fe52df4 in /home/miniforge3/envs/flux/bin/python) rank3: frame #23: _PyObject_MakeTpCall + 0x1f8 (0x56400fe19db8 in /home/miniforge3/envs/flux/bin/python) rank3: frame #24: + 0xeb5a7 (0x56400fdcb5a7 in /home/miniforge3/envs/flux/bin/python) rank3: frame #25: + 0x105bbf (0x56400fde5bbf in /home/miniforge3/envs/flux/bin/python) rank3: frame #26: + 0x1871eb (0x56400fe671eb in /home/miniforge3/envs/flux/bin/python) rank3: frame #27: _PyObject_Call + 0x1f6 (0x56400fe203f6 in /home/miniforge3/envs/flux/bin/python) rank3: frame #28: _PyEval_EvalFrameDefault + 0x2216 (0x56400febac16 in /home/miniforge3/envs/flux/bin/python) rank3: frame #29: + 0x1871eb (0x56400fe671eb in /home/miniforge3/envs/flux/bin/python) rank3: frame #30: + 0x10669e (0x56400fde669e in /home/miniforge3/envs/flux/bin/python) rank3: frame #31: + 0x1b6ca5 (0x56400fe96ca5 in /home/miniforge3/envs/flux/bin/python) rank3: frame #32: + 0x10669e (0x56400fde669e in /home/miniforge3/envs/flux/bin/python) rank3: frame #33: + 0x1871eb (0x56400fe671eb in /home/miniforge3/envs/flux/bin/python) rank3: frame #34: + 0x105472 (0x56400fde5472 in /home/miniforge3/envs/flux/bin/python) rank3: frame #35: + 0x1871eb (0x56400fe671eb in /home/miniforge3/envs/flux/bin/python) rank3: frame #36: + 0x106c30 (0x56400fde6c30 in /home/miniforge3/envs/flux/bin/python) rank3: frame #37: + 0x1871eb (0x56400fe671eb in /home/miniforge3/envs/flux/bin/python) rank3: frame #38: + 0x105472 (0x56400fde5472 in /home/miniforge3/envs/flux/bin/python) rank3: frame #39: + 0x1871eb (0x56400fe671eb in /home/miniforge3/envs/flux/bin/python) rank3: frame #40: PyEval_EvalCode + 0x88 (0x56400fe780e8 in /home/miniforge3/envs/flux/bin/python) rank3: frame #41: + 0x248f1b (0x56400ff28f1b in /home/miniforge3/envs/flux/bin/python) rank3: frame #42: + 0x27e805 (0x56400ff5e805 in /home/miniforge3/envs/flux/bin/python) rank3: frame #43: + 0x280bb0 (0x56400ff60bb0 in /home/miniforge3/envs/flux/bin/python) rank3: frame #44: _PyRun_SimpleFileObject + 0x1b8 (0x56400ff60d98 in /home/miniforge3/envs/flux/bin/python) rank3: frame #45: _PyRun_AnyFileObject + 0x44 (0x56400ff60ea4 in /home/miniforge3/envs/flux/bin/python) rank3: frame #46: Py_RunMain + 0x3ff (0x56400ff6205f in /home/miniforge3/envs/flux/bin/python) rank3: frame #47: Py_BytesMain + 0x39 (0x56400ff621e9 in /home/miniforge3/envs/flux/bin/python) rank3: frame #48: __libc_start_main + 0xf5 (0x7f2f66cf7555 in /lib64/libc.so.6) rank3: frame #49: + 0x206e86 (0x56400fee6e86 in /home/miniforge3/envs/flux/bin/python) rank3: . This may indicate a possible application crash on rank 0 or a network set up issue.

kohya-ss commented 1 month ago

I don't think there has been a report like this before. If you restart it, caching may start from where it left off.

Currently, it is slow because caching is not supported multi-GPU (caching is run on one GPU). Specifying --highvram and increasing --vae_batch_size according to the VRAM (for example, from about 4) may improve the speed.

huxian0402 commented 1 month ago

I don't think there has been a report like this before. If you restart it, caching may start from where it left off.

Currently, it is slow because caching is not supported multi-GPU (caching is run on one GPU). Specifying --highvram and increasing --vae_batch_size according to the VRAM (for example, from about 4) may improve the speed.

@kohya-ss Thank you for your reply. After specifying --highvram and increasing --vae_batch_size according to the VRAM, the cache latent speed has indeed improved, but it still takes a long time. When the training data is particularly large, the waiting time is unbearable. It seems that multi-GPU is the best solution for now. When will caching support multi-GPU operation?

huxian0402 commented 1 month ago

I don't think there has been a report like this before. If you restart it, caching may start from where it left off.

Currently, it is slow because caching is not supported multi-GPU (caching is run on one GPU). Specifying --highvram and increasing --vae_batch_size according to the VRAM (for example, from about 4) may improve the speed.

@kohya-ss I've tried many times and confirmed that the NCCL node timeout and poolFD failed error are caused by the slow cache speed. It seems that when the training data is large and multi-GPU training is initiated, the cache processing nested within the training process is not very efficient. Would it be better to separate the caching of Flux's text encoder output and VAE latent into a standalone data preprocessing script, and run it once before training to ensure that the cache is completed?

kohya-ss commented 1 month ago

Would it be better to separate the caching of Flux's text encoder output and VAE latent into a standalone data preprocessing script, and run it once before training to ensure that the cache is completed?

I think you're right. There are scripts called cache_latents.py and cache_text_encoder_outputs.py in the repository, but they are not compatible with FLUX.1 yet.