guoqincode / Open-AnimateAnyone

Unofficial Implementation of Animate Anyone
2.91k stars 233 forks source link

when i use distribution training, i met an error which i use what i can try, but i still can't finish. would mind help me #108

Open xiaohutongxue-sunny opened 8 months ago

xiaohutongxue-sunny commented 8 months ago

2024-01-24 20-27-48 的屏幕截图

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 44522) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_hack.py FAILED

Failures: [1]: time : 2024-01-24_20:20:12 host : 5038d4aa163f rank : 1 (local_rank: 1) exitcode : 1 (pid: 44523) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-01-24_20:20:12 host : 5038d4aa163f rank : 0 (local_rank: 0) exitcode : 1 (pid: 44522) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

guoqincode commented 8 months ago

Hello, it looks like your server has a bug with distributed training... Sorry, I don't really understand how distributed training works, so I may not be able to help you.

fenghan0430 commented 8 months ago

hello. Can you provide a complete error report? A configuration file error can also prevent distributed training from running.

xiaohutongxue-sunny commented 8 months ago

hello. Can you provide a complete error report? A configuration file error can also prevent distributed training from running.

2024-01-30 14-55-51 的屏幕截图 2024-01-30 14-56-12 的屏幕截图 the other one is load issue, i cant load stable diffusion. all errors are here. i changed code not using from_pretrained instead of make a object (ReferenceNet() replace ReferenceNet.from_pretrained(pretrained_model_path, subfolder="unet"))

xiaohutongxue-sunny commented 8 months ago

Hello, it looks like your server has a bug with distributed training... Sorry, I don't really understand how distributed training works, so I may not be able to help you.

i finish the bug which caused by stable-diffusion, my stable-diffusion checkpoint includes some error that leads to model can't load successfully. it's amazing

Valentino-L commented 7 months ago

Hello! I am having the same error as you did. I'm only running on a single GPU. Could you show me exactly how you solved this problem?

Thanks a lot!

fenghan0430 commented 7 months ago

Hello! I am having the same error as you did. I'm only running on a single GPU. Could you show me exactly how you solved this problem?

Thanks a lot!

Have you set all the training parameters correctly?I wrote an instruction to start the first stage of training.In my git.

xiaohutongxue-sunny commented 7 months ago

Hello! I am having the same error as you did. I'm only running on a single GPU. Could you show me exactly how you solved this problem?

Thanks a lot!

You can solve in two ways:1) change the distribution training instead of normal training.2) give me more details. same problems but different solutions.

Valentino-L commented 7 months ago

I found out that it was actually an OOM error. I solved it by adjusting the config file. Thanks a lot!