Open alexdaszek opened 7 hours ago
👋 Hello @alexdaszek, thank you for reaching out about your multi-GPU training issue with YOLOv10m 🚀! An Ultralytics engineer will assist you soon, but in the meantime, here's some information that might help.
If you believe this is a 🐛 Bug Report, please ensure you have provided a detailed minimum reproducible example, which you have done excellently. This is crucial in diagnosing and resolving issues efficiently.
If this is a setup or custom training ❓ Question, make sure all your system dependencies, including Python, PyTorch, and CUDA, are compatible with the ultralytics
package. Multi-GPU setups can sometimes introduce complexities, especially if different GPU models are used, but you've mentioned using identical GPUs, which is great.
For additional support, I recommend participating in discussions on platforms like Discord or the Ultralytics community forums, where you can engage with others who might have faced similar challenges or share knowledge with other community members.
It’s also a good practice to ensure you are running the latest version of the ultralytics
package and to verify that all dependencies are up to date in a proper Python environment. You might also check that the PyTorch installation includes libuv support as indicated by the error message—sometimes, building PyTorch from source with specific flags can resolve such issues.
Stay tuned for more information from the Ultralytics team! 😊
Try adding this before your code:
import os
os.environ["USE_LIBUV"] = "0"
Try adding this before your code:
import os os.environ["USE_LIBUV"] = "0"
Thanks for the suggestion, I updated my script to use this:
import argparse
from ultralytics import YOLO
import torch
import os
os.environ["USE_LIBUV"] = "0"
if __name__ == '__main__':
# Instead, get local_rank from environment variable
local_rank = int(os.environ.get('LOCAL_RANK', -1))
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of available GPUs: {torch.cuda.device_count()}")
print(f"Local rank: {local_rank}")
torch.cuda.empty_cache() # Clear GPU memory
model = YOLO('yolov10m.pt')
dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'
results = model.train(
data=dataset_yaml_path,
epochs=100,
imgsz=640,
batch=16,
device=[0,1],
workers=8,
verbose=True
)
I get the same error (I saw this mentioned I think from Claude, I didn't understand it because it looks like we're telling it to not use it, but the error message says we don't have it so wouldn't we want to enable it?). I noticed early in the logs though this DDP debug line to run:
DDP: debug command C:\Users\aldas\miniconda3\envs\yolo_env_py310\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 51974 C:\Users\aldas\AppData\Roaming\Ultralytics\DDP\_temp_cyxeo4yl1920210943952.py
W1021 18:39:49.331000 22344 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
So I ran that, and this is what the console shows:
W1021 18:42:20.966000 13072 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 905, in <module>
main()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 901, in main
run(args)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 892, in run
elastic_launch(
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
result = agent.run()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 680, in run
result = self._invoke_run(role)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
PS C:\Users\aldas\yolo-nepenthes-model>
But a quote from glenn here give me the impression this is not really an issue just something to tweak if experiencing performance issues and I haven't gotten that far yet.
The warning you're seeing is designed to alert users that setting the "OMP_NUM_THREADS" environment too high might cause the system to be overloaded. It suggests further fine-tuning of this variable for optimal performance as needed.
But I tried adding os.environ["OMP_NUM_THREADS"] = "4"
anyway and no dice, same error about libuv.
Can you place the code I sent before importing torch?
Also try adding this after your imports:
import ultralytics.engine.trainer as trainer
from torch import distributed as dist
def _setup_ddp(self, world_size):
"""Initializes and sets the DistributedDataParallel parameters for training."""
torch.cuda.set_device(RANK)
self.device = torch.device("cuda", RANK)
# LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" # set to enforce timeout
dist.init_process_group(
backend="nccl" if dist.is_nccl_available() else "gloo",
init_method="env://?use_libuv=False",
timeout=timedelta(seconds=10800), # 3 hours
rank=RANK,
world_size=world_size,
)
trainer._setup_ddp = _setup_ddp
Can you place the code I sent before importing torch?
Sure, I changed it to this:
import argparse
from ultralytics import YOLO
import os
os.environ["USE_LIBUV"] = "0"
import torch
if __name__ == '__main__':
# Instead, get local_rank from environment variable
local_rank = int(os.environ.get('LOCAL_RANK', -1))
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of available GPUs: {torch.cuda.device_count()}")
print(f"Local rank: {local_rank}")
torch.cuda.empty_cache() # Clear GPU memory
# Load the pre-trained model
model = YOLO('yolov10m.pt')
# Minimal training configuration
dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'
results = model.train(
data=dataset_yaml_path,
epochs=100,
imgsz=640,
batch=16,
device=[0,1],
workers=8,
verbose=True
)
Am still getting the libuv error:
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 26, in <module>
results = model.train(
, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
e 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 26, in <module>
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 26, in <module>
results = model.train(
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
self.trainer.train()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 202, in train
raise e
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 200, in train
subprocess.run(cmd, check=True)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\aldas\\miniconda3\\envs\\yolo_env_py310\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '54084', 'C:\\Users\\aldas\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_3deaeiab1798233636816.py']' returned non-zero exit status 1.
This was still with PyTorch version 2.4.1, should I go back to 2.5? Not sure if it would make a difference.
Just saw your update about the function to try, RANK wasn't defined so I tried to define it like RANK = int(os.environ.get("RANK", -1))
is that correct? I am still getting the libuv error, though, maybe due to that not being correct.
import argparse
from ultralytics import YOLO
import os
os.environ["USE_LIBUV"] = "0"
import torch
from datetime import timedelta
import ultralytics.engine.trainer as trainer
from torch import distributed as dist
def _setup_ddp(self, world_size):
"""Initializes and sets the DistributedDataParallel parameters for training."""
RANK = int(os.environ.get("RANK", -1))
torch.cuda.set_device(RANK)
self.device = torch.device("cuda", RANK)
# LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" # set to enforce timeout
dist.init_process_group(
backend="nccl" if dist.is_nccl_available() else "gloo",
init_method="env://?use_libuv=False",
timeout=timedelta(seconds=10800), # 3 hours
rank=RANK,
world_size=world_size,
)
trainer._setup_ddp = _setup_ddp
if __name__ == '__main__':
# Instead, get local_rank from environment variable
local_rank = int(os.environ.get('LOCAL_RANK', -1))
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of available GPUs: {torch.cuda.device_count()}")
print(f"Local rank: {local_rank}")
torch.cuda.empty_cache() # Clear GPU memory
model = YOLO('yolov10m.pt')
# Minimal training configuration
dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'
results = model.train(
data=dataset_yaml_path,
epochs=100,
imgsz=640,
batch=16,
device=[0,1],
workers=8,
verbose=True
)
The log from that run:
DDP: debug command C:\Users\aldas\miniconda3\envs\yolo_env_py310\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 54215 C:\Users\aldas\AppData\Roaming\Ultralytics\DDP\_temp_bd4jh5gl2032857398864.py
W1021 19:21:30.531000 10520 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 905, in <module>
main()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 901, in main
run(args)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 892, in run
elastic_launch(
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
result = agent.run()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 680, in run
result = self._invoke_run(role)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
self._rendezvous(worker_group)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124 File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
e 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
results = model.train(
results = model.train(
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
self.trainer.train()
self.trainer.train()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 202, in train
raise e
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 200, in train
subprocess.run(cmd, check=True)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\aldas\\miniconda3\\envs\\yolo_env_py310\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '54215', 'C:\\Users\\aldas\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_bd4jh5gl2032857398864.py']' returned non-zero exit status 1.
I guess you will have to edit the source code and replace this function with the version I sent.
C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py
This file
I guess you will have to edit the source code and replace this function with the version I sent.
I'm a little confused by this, the functions are already the same? Unless you just mean the line that I had to add, RANK = int(os.environ.get("RANK", -1))
I added that line to my local trainer.py file just to see what would happen, but it didn't change anything and I still have the libuv error. Looking through that code, I don't think declaring RANK is correct as it's elsewhere in the trainer.py file. I'll try to step through that code and see if I can see what the actual RANK value is. I appreciate all the suggestions
It isn't the same.
It has an extra line
init_method="env://?use_libuv=False",
You should just paste the _setup_ddp
I sent and replace the one in the file.
It isn't the same.
It has an extra line
init_method="env://?use_libuv=False",
You're right! I modified the local trainer.py file and updated my script.
This is the minimal example:
import argparse
from ultralytics import YOLO
import os
os.environ["USE_LIBUV"] = "0"
import torch
if __name__ == '__main__':
# Instead, get local_rank from environment variable
local_rank = int(os.environ.get('LOCAL_RANK', -1))
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of available GPUs: {torch.cuda.device_count()}")
print(f"Local rank: {local_rank}")
torch.cuda.empty_cache() # Clear GPU memory
# Load the pre-trained model
model = YOLO('yolov10m.pt')
# Minimal training configuration
dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'
results = model.train(
data=dataset_yaml_path,
epochs=100,
imgsz=640,
batch=16,
device=[0,1],
workers=8,
verbose=True
)
And to local trainer.py, I had to add import ultralytics.engine.trainer as trainer
at the top, and init_method="env://?use_libuv=False
to the function itself as you said, and then trainer._setup_ddp = _setup_ddp
outside of the function but inside of the class.
def _setup_ddp(self, world_size):
"""Initializes and sets the DistributedDataParallel parameters for training."""
torch.cuda.set_device(RANK)
self.device = torch.device("cuda", RANK)
# LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" # set to enforce timeout
dist.init_process_group(
backend="nccl" if dist.is_nccl_available() else "gloo",
init_method="env://?use_libuv=False",
timeout=timedelta(seconds=10800), # 3 hours
rank=RANK,
world_size=world_size,
)
trainer._setup_ddp = _setup_ddp
Here are the logs, still logging libuv unfortuneately. Running the debug command also throws the libuv error.
DDP: debug command C:\Users\aldas\miniconda3\envs\yolo_env_py310\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 55668 C:\Users\aldas\AppData\Roaming\Ultralytics\DDP\_temp_in20ydga1929515286480.py
W1021 20:12:33.050000 10984 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 905, in <module>
main()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 901, in main
run(args)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 892, in run
elastic_launch(
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
result = agent.run()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 680, in run
result = self._invoke_run(role)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
self._rendezvous(worker_group)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
result = f(*args, **kwargs)
result = f(*args, **kwargs)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
results = model.train(
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
results = model.train(
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
results = model.train(
File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
results = model.train(
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
self.trainer.train()
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 205, in train
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 205, in train
raise e
raise e
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 203, in train
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 203, in train
subprocess.run(cmd, check=True)
File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\aldas\\miniconda3\\envs\\yolo_env_py310\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '55668', 'C:\\Users\\aldas\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_in20ydga1929515286480.py']' returned non-zero exit status 1.
You just need to replace the function with this:
def _setup_ddp(self, world_size):
"""Initializes and sets the DistributedDataParallel parameters for training."""
torch.cuda.set_device(RANK)
self.device = torch.device("cuda", RANK)
# LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" # set to enforce timeout
dist.init_process_group(
backend="nccl" if dist.is_nccl_available() else "gloo",
init_method="env://?use_libuv=False",
timeout=timedelta(seconds=10800), # 3 hours
rank=RANK,
world_size=world_size,
)
You don't need to perform the import or do trainer._setup_ddp = _setup_ddp
.
You can also try downgrading to PyTorch 2.3.x
https://github.com/RVC-Boss/GPT-SoVITS/issues/1357#issuecomment-2255295246
Discussed in https://github.com/orgs/ultralytics/discussions/16259
I'm experiencing the same issue but do have identical GPUs, but still recieve the error -
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
.I'm using two 4060 TIs, pytorch version 2.5.0, CUDA 12.4. Here is my minimal reproducable code example, which includes logs to verify versions and that it does see both GPUs:
Here is what it logs:
The linked pytorch versions page didn't mention libuv, but I tried it with version 2.4.1 just in case and got the same error message about libuv:
If I try
torch.distributed.launch
I get the same libuv error as well as a warning that torch.distributed.launch is deprecated, but just after that it had a note about the LOCAL_RANK usage which was wrong in my script, something about accessing it from the environment variable. So I changed the minimal example to this to fix that:This gives a different error,
ValueError: Invalid CUDA 'device=-1' requested.
I am not sure why my LOCAL_RANK is resolving to -1, and if I follow the log message to just set it to
'device=0,1'
then I just start getting the original error,RuntimeError: use_libuv was requested but PyTorch was build without libuv support
.What am I missing here to get multi-gpu training working? Thanks in advance, much appreciated