kohya-ss / sd-scripts

Apache License 2.0
4.97k stars 835 forks source link

When I train with multiple GPUs, this error will be reported. This has been bothering me for a long time. How can I solve it #1627

Open jinwei1660 opened 2 days ago

jinwei1660 commented 2 days ago

env: pytorch 2.6 2gpus

W0921 17:28:08.135000 56224 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs. Traceback (most recent call last): File "e:\pinokio\bin\miniconda\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "e:\pinokio\bin\miniconda\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "E:\pinokio\api\fluxgym.git\env\Scripts\accelerate.exe__main.py", line 7, in File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 1159, in launch_command multi_gpu_launcher(args) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\run.py", line 910, in run elastic_launch( File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\launcher\api.py", line 138, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\launcher\api.py", line 260, in launch_agent result = agent.run() File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper result = f(*args, kwargs) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 696, in run result = self._invoke_run(role) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 849, in _invoke_run self._initialize_workers(self._worker_group) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper result = f(*args, *kwargs) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 668, in _initialize_workers self._rendezvous(worker_group) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper result = f(args, kwargs) File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 500, in _rendezvous rdzv_info = spec.rdzv_handler.next_rendezvous() File "e:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 67, in next_rendezvous self._store = TCPStore( # type: ignore[call-arg] RuntimeError: use_libuv was requested but PyTorch was build without libuv support

kohya-ss commented 22 hours ago

Unfortunately Accelerate doesn't seem to support multi GPU training on Windows (gloo backend). WSL might be the solution.