Open xiaohutongxue-sunny opened 8 months ago
Hello, it looks like your server has a bug with distributed training... Sorry, I don't really understand how distributed training works, so I may not be able to help you.
hello. Can you provide a complete error report? A configuration file error can also prevent distributed training from running.
hello. Can you provide a complete error report? A configuration file error can also prevent distributed training from running.
the other one is load issue, i cant load stable diffusion. all errors are here. i changed code not using from_pretrained instead of make a object (ReferenceNet() replace ReferenceNet.from_pretrained(pretrained_model_path, subfolder="unet"))
Hello, it looks like your server has a bug with distributed training... Sorry, I don't really understand how distributed training works, so I may not be able to help you.
i finish the bug which caused by stable-diffusion, my stable-diffusion checkpoint includes some error that leads to model can't load successfully. it's amazing
Hello! I am having the same error as you did. I'm only running on a single GPU. Could you show me exactly how you solved this problem?
Thanks a lot!
Hello! I am having the same error as you did. I'm only running on a single GPU. Could you show me exactly how you solved this problem?
Thanks a lot!
Have you set all the training parameters correctly?I wrote an instruction to start the first stage of training.In my git.
Hello! I am having the same error as you did. I'm only running on a single GPU. Could you show me exactly how you solved this problem?
Thanks a lot!
You can solve in two ways:1) change the distribution training instead of normal training.2) give me more details. same problems but different solutions.
I found out that it was actually an OOM error. I solved it by adjusting the config file. Thanks a lot!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 44522) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_hack.py FAILED
Failures: [1]: time : 2024-01-24_20:20:12 host : 5038d4aa163f rank : 1 (local_rank: 1) exitcode : 1 (pid: 44523) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-01-24_20:20:12 host : 5038d4aa163f rank : 0 (local_rank: 0) exitcode : 1 (pid: 44522) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html