ultralytics / ultralytics

Ultralytics YOLO11 🚀
https://docs.ultralytics.com
GNU Affero General Public License v3.0
31.24k stars 6k forks source link

Unable to Train on Multiple, identical, GPUs with YOLOv10m #17080

Open alexdaszek opened 7 hours ago

alexdaszek commented 7 hours ago

Discussed in https://github.com/orgs/ultralytics/discussions/16259

Originally posted by **monggus525** September 12, 2024 I am attempting to train YOLOv10n detection using 4070 Ti (GPU0) and 4070 (GPU1) together. When I specify device=0,1 in the training command, I encounter the following error message: ``` RuntimeError: use_libuv was requested but PyTorch was built without libuv support subprocess.CalledProcessError: Command '['C:\\Users\\MAI\\anaconda3\\envs\\yolov8\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '65222', 'C:\\Users\\MAI\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_opwapjid2112633383520.py']' returned non-zero exit status 1. ``` My training code was: `yolo train model=yolov10n.pt device=0,1 epochs=100 data=dataset_fixed.yaml imgsz=1600 batch=4` I have CUDA 12.4 installed, PyTorch version 2.4.1, and the Ultralytics YOLOv8 environment set up. Despite having the latest versions, I still face this issue. Based on GPT's suggestion, I also installed pyuv, but the problem persists. Additionally, if I don't specify device=0,1 and just run the training, it only uses device=0 (single GPU), rather than utilizing both GPUs. Is it expected to run into errors when using GPUs with different models (in this case: 4070 Ti + 4070) for multi-GPU training? If different GPU models cannot be used, should I use identical GPUs (e.g., 4070 Ti + 4070 Ti or 4070 + 4070)? If different GPU models are supported, how can I resolve this issue and successfully train on both GPUs?

I'm experiencing the same issue but do have identical GPUs, but still recieve the error - RuntimeError: use_libuv was requested but PyTorch was build without libuv support.

I'm using two 4060 TIs, pytorch version 2.5.0, CUDA 12.4. Here is my minimal reproducable code example, which includes logs to verify versions and that it does see both GPUs:

from ultralytics import YOLO
import torch

if __name__ == '__main__':
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")

    torch.cuda.empty_cache()  # Clear GPU memory

    # Load the pre-trained model
    model = YOLO('yolov10m.pt')

    # Minimal training configuration
    dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'

    results = model.train(
        data=dataset_yaml_path,
        epochs=100,
        imgsz=640,
        batch=16,
        device=[0, 1],
        workers=8,
        verbose=True
    )

Here is what it logs:

PyTorch version: 2.5.0+cu124
CUDA available: True
CUDA version: 12.4
Number of available GPUs: 2
Ultralytics 8.3.17 🚀 Python-3.10.15 torch-2.5.0+cu124 CUDA:0 (NVIDIA GeForce RTX 4060 Ti, 16380MiB)
                                                       CUDA:1 (NVIDIA GeForce RTX 4060 Ti, 16380MiB)

The linked pytorch versions page didn't mention libuv, but I tried it with version 2.4.1 just in case and got the same error message about libuv:

PyTorch version: 2.4.1
CUDA available: True
CUDA version: 12.4
Number of available GPUs: 2
Ultralytics 8.3.17 🚀 Python-3.10.15 torch-2.4.1 CUDA:0 (NVIDIA GeForce RTX 4060 Ti, 16380MiB)
                                                 CUDA:1 (NVIDIA GeForce RTX 4060 Ti, 16380MiB)

If I try torch.distributed.launch I get the same libuv error as well as a warning that torch.distributed.launch is deprecated, but just after that it had a note about the LOCAL_RANK usage which was wrong in my script, something about accessing it from the environment variable. So I changed the minimal example to this to fix that:

import argparse
from ultralytics import YOLO
import torch
import os

if __name__ == '__main__':
    # Instead, get local_rank from environment variable
    local_rank = int(os.environ.get('LOCAL_RANK', -1))

    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    print(f"Local rank: {local_rank}")

    torch.cuda.empty_cache()  # Clear GPU memory

    # Load the pre-trained model
    model = YOLO('yolov10m.pt')

    # Minimal training configuration
    dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'

    results = model.train(
        data=dataset_yaml_path,
        epochs=100,
        imgsz=640,
        batch=16,
        device=[local_rank],
        workers=8,
        verbose=True
    )

This gives a different error, ValueError: Invalid CUDA 'device=-1' requested.

PyTorch version: 2.4.1
CUDA available: True
CUDA version: 12.4
Number of available GPUs: 2
Local rank: -1
Ultralytics 8.3.17 🚀 Python-3.10.15 torch-2.4.1 
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 29, in <module>
    results = model.train(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 796, in train   
    self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 103, in __init__
    self.device = select_device(self.args.device, self.args.batch)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\utils\torch_utils.py", line 192, in select_device
    raise ValueError(
ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.
ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0'ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.
ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.
ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.

ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.
ValueError: Invalid CUDA 'device=-1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU.

torch.cuda.is_available(): True
torch.cuda.device_count(): 0
os.environ['CUDA_VISIBLE_DEVICES']: None

I am not sure why my LOCAL_RANK is resolving to -1, and if I follow the log message to just set it to 'device=0,1' then I just start getting the original error, RuntimeError: use_libuv was requested but PyTorch was build without libuv support.

What am I missing here to get multi-gpu training working? Thanks in advance, much appreciated

UltralyticsAssistant commented 7 hours ago

👋 Hello @alexdaszek, thank you for reaching out about your multi-GPU training issue with YOLOv10m 🚀! An Ultralytics engineer will assist you soon, but in the meantime, here's some information that might help.

If you believe this is a 🐛 Bug Report, please ensure you have provided a detailed minimum reproducible example, which you have done excellently. This is crucial in diagnosing and resolving issues efficiently.

If this is a setup or custom training ❓ Question, make sure all your system dependencies, including Python, PyTorch, and CUDA, are compatible with the ultralytics package. Multi-GPU setups can sometimes introduce complexities, especially if different GPU models are used, but you've mentioned using identical GPUs, which is great.

For additional support, I recommend participating in discussions on platforms like Discord or the Ultralytics community forums, where you can engage with others who might have faced similar challenges or share knowledge with other community members.

It’s also a good practice to ensure you are running the latest version of the ultralytics package and to verify that all dependencies are up to date in a proper Python environment. You might also check that the PyTorch installation includes libuv support as indicated by the error message—sometimes, building PyTorch from source with specific flags can resolve such issues.

Stay tuned for more information from the Ultralytics team! 😊

Y-T-G commented 6 hours ago

Try adding this before your code:

import os
os.environ["USE_LIBUV"] = "0"
alexdaszek commented 6 hours ago

Try adding this before your code:

import os
os.environ["USE_LIBUV"] = "0"

Thanks for the suggestion, I updated my script to use this:

import argparse
from ultralytics import YOLO
import torch
import os
os.environ["USE_LIBUV"] = "0"

if __name__ == '__main__':
    # Instead, get local_rank from environment variable
    local_rank = int(os.environ.get('LOCAL_RANK', -1))

    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    print(f"Local rank: {local_rank}")

    torch.cuda.empty_cache()  # Clear GPU memory

    model = YOLO('yolov10m.pt')

    dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'

    results = model.train(
        data=dataset_yaml_path,
        epochs=100,
        imgsz=640,
        batch=16,
        device=[0,1],
        workers=8,
        verbose=True
    )

I get the same error (I saw this mentioned I think from Claude, I didn't understand it because it looks like we're telling it to not use it, but the error message says we don't have it so wouldn't we want to enable it?). I noticed early in the logs though this DDP debug line to run:

DDP: debug command C:\Users\aldas\miniconda3\envs\yolo_env_py310\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 51974 C:\Users\aldas\AppData\Roaming\Ultralytics\DDP\_temp_cyxeo4yl1920210943952.py
W1021 18:39:49.331000 22344 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.

So I ran that, and this is what the console shows:


W1021 18:42:20.966000 13072 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 905, in <module>   
    main()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 901, in main       
    run(args)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 892, in run        
    elastic_launch(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
    result = agent.run()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 680, in run
    result = self._invoke_run(role)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 829, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 652, in _initialize_workers
    self._rendezvous(worker_group)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
PS C:\Users\aldas\yolo-nepenthes-model> 

But a quote from glenn here give me the impression this is not really an issue just something to tweak if experiencing performance issues and I haven't gotten that far yet.

The warning you're seeing is designed to alert users that setting the "OMP_NUM_THREADS" environment too high might cause the system to be overloaded. It suggests further fine-tuning of this variable for optimal performance as needed.

But I tried adding os.environ["OMP_NUM_THREADS"] = "4" anyway and no dice, same error about libuv.

Y-T-G commented 5 hours ago

Can you place the code I sent before importing torch?

Y-T-G commented 5 hours ago

Also try adding this after your imports:

import ultralytics.engine.trainer as trainer
from torch import distributed as dist

def _setup_ddp(self, world_size):
  """Initializes and sets the DistributedDataParallel parameters for training."""
  torch.cuda.set_device(RANK)
  self.device = torch.device("cuda", RANK)
  # LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
  os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1"  # set to enforce timeout
  dist.init_process_group(
      backend="nccl" if dist.is_nccl_available() else "gloo",
      init_method="env://?use_libuv=False",
      timeout=timedelta(seconds=10800),  # 3 hours
      rank=RANK,
      world_size=world_size,
  )

trainer._setup_ddp = _setup_ddp
alexdaszek commented 5 hours ago

Can you place the code I sent before importing torch?

Sure, I changed it to this:

import argparse
from ultralytics import YOLO
import os
os.environ["USE_LIBUV"] = "0"
import torch

if __name__ == '__main__':
    # Instead, get local_rank from environment variable
    local_rank = int(os.environ.get('LOCAL_RANK', -1))

    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    print(f"Local rank: {local_rank}")

    torch.cuda.empty_cache()  # Clear GPU memory

    # Load the pre-trained model
    model = YOLO('yolov10m.pt')

    # Minimal training configuration
    dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'

    results = model.train(
        data=dataset_yaml_path,
        epochs=100,
        imgsz=640,
        batch=16,
        device=[0,1],
        workers=8,
        verbose=True
    )

Am still getting the libuv error:

RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 26, in <module>
    results = model.train(
, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
e 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 26, in <module>
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 26, in <module>
    results = model.train(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train   
    self.trainer.train()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 202, in train 
    raise e
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 200, in train 
    subprocess.run(cmd, check=True)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\aldas\\miniconda3\\envs\\yolo_env_py310\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '54084', 'C:\\Users\\aldas\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_3deaeiab1798233636816.py']' returned non-zero exit status 1.

This was still with PyTorch version 2.4.1, should I go back to 2.5? Not sure if it would make a difference. Just saw your update about the function to try, RANK wasn't defined so I tried to define it like RANK = int(os.environ.get("RANK", -1)) is that correct? I am still getting the libuv error, though, maybe due to that not being correct.

import argparse
from ultralytics import YOLO
import os
os.environ["USE_LIBUV"] = "0"
import torch

from datetime import timedelta
import ultralytics.engine.trainer as trainer
from torch import distributed as dist

def _setup_ddp(self, world_size):
  """Initializes and sets the DistributedDataParallel parameters for training."""
  RANK = int(os.environ.get("RANK", -1))
  torch.cuda.set_device(RANK)
  self.device = torch.device("cuda", RANK)
  # LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
  os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1"  # set to enforce timeout
  dist.init_process_group(
      backend="nccl" if dist.is_nccl_available() else "gloo",
      init_method="env://?use_libuv=False",
      timeout=timedelta(seconds=10800),  # 3 hours
      rank=RANK,
      world_size=world_size,
  )

trainer._setup_ddp = _setup_ddp

if __name__ == '__main__':
    # Instead, get local_rank from environment variable
    local_rank = int(os.environ.get('LOCAL_RANK', -1))

    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    print(f"Local rank: {local_rank}")

    torch.cuda.empty_cache()  # Clear GPU memory

    model = YOLO('yolov10m.pt')

    # Minimal training configuration
    dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'

    results = model.train(
        data=dataset_yaml_path,
        epochs=100,
        imgsz=640,
        batch=16,
        device=[0,1],
        workers=8,
        verbose=True
    )

The log from that run:

DDP: debug command C:\Users\aldas\miniconda3\envs\yolo_env_py310\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 54215 C:\Users\aldas\AppData\Roaming\Ultralytics\DDP\_temp_bd4jh5gl2032857398864.py
W1021 19:21:30.531000 10520 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 905, in <module>   
    main()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 901, in main       
    run(args)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 892, in run        
    elastic_launch(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
    result = agent.run()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 680, in run
    result = self._invoke_run(role)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 829, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 652, in _initialize_workers
    self._rendezvous(worker_group)
    self._rendezvous(worker_group)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
e 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
    results = model.train(
    results = model.train(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train   
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train   
    self.trainer.train()
    self.trainer.train()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 202, in train 
    raise e
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 200, in train 
    subprocess.run(cmd, check=True)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\aldas\\miniconda3\\envs\\yolo_env_py310\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '54215', 'C:\\Users\\aldas\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_bd4jh5gl2032857398864.py']' returned non-zero exit status 1.
Y-T-G commented 5 hours ago

I guess you will have to edit the source code and replace this function with the version I sent.

https://github.com/ultralytics/ultralytics/blob/767aa1caccc732d9aef8d67814e52491732f66c8/ultralytics/engine/trainer.py#L217

Y-T-G commented 5 hours ago

C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py

This file

alexdaszek commented 4 hours ago

I guess you will have to edit the source code and replace this function with the version I sent.

https://github.com/ultralytics/ultralytics/blob/767aa1caccc732d9aef8d67814e52491732f66c8/ultralytics/engine/trainer.py#L217

I'm a little confused by this, the functions are already the same? Unless you just mean the line that I had to add, RANK = int(os.environ.get("RANK", -1))

I added that line to my local trainer.py file just to see what would happen, but it didn't change anything and I still have the libuv error. Looking through that code, I don't think declaring RANK is correct as it's elsewhere in the trainer.py file. I'll try to step through that code and see if I can see what the actual RANK value is. I appreciate all the suggestions

Y-T-G commented 4 hours ago

It isn't the same.

It has an extra line

init_method="env://?use_libuv=False",

Y-T-G commented 4 hours ago

You should just paste the _setup_ddp I sent and replace the one in the file.

alexdaszek commented 4 hours ago

It isn't the same.

It has an extra line

init_method="env://?use_libuv=False",

You're right! I modified the local trainer.py file and updated my script.

This is the minimal example:

import argparse
from ultralytics import YOLO
import os
os.environ["USE_LIBUV"] = "0"
import torch

if __name__ == '__main__':
    # Instead, get local_rank from environment variable
    local_rank = int(os.environ.get('LOCAL_RANK', -1))

    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    print(f"Local rank: {local_rank}")

    torch.cuda.empty_cache()  # Clear GPU memory

    # Load the pre-trained model
    model = YOLO('yolov10m.pt')

    # Minimal training configuration
    dataset_yaml_path = r'datasets\nepenthes-species\yolov8\data.yaml'

    results = model.train(
        data=dataset_yaml_path,
        epochs=100,
        imgsz=640,
        batch=16,
        device=[0,1],
        workers=8,
        verbose=True
    )

And to local trainer.py, I had to add import ultralytics.engine.trainer as trainer at the top, and init_method="env://?use_libuv=False to the function itself as you said, and then trainer._setup_ddp = _setup_ddp outside of the function but inside of the class.

    def _setup_ddp(self, world_size):
        """Initializes and sets the DistributedDataParallel parameters for training."""
        torch.cuda.set_device(RANK)
        self.device = torch.device("cuda", RANK)
        # LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
        os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1"  # set to enforce timeout
        dist.init_process_group(
            backend="nccl" if dist.is_nccl_available() else "gloo",
            init_method="env://?use_libuv=False",
            timeout=timedelta(seconds=10800),  # 3 hours
            rank=RANK,
            world_size=world_size,
        )

    trainer._setup_ddp = _setup_ddp

Here are the logs, still logging libuv unfortuneately. Running the debug command also throws the libuv error.

DDP: debug command C:\Users\aldas\miniconda3\envs\yolo_env_py310\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 55668 C:\Users\aldas\AppData\Roaming\Ultralytics\DDP\_temp_in20ydga1929515286480.py
W1021 20:12:33.050000 10984 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
Traceback (most recent call last):
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 905, in <module>
    main()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 348, in wrapper 
    return f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 901, in main
    run(args)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\run.py", line 892, in run
    elastic_launch(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\launcher\api.py", line 255, in launch_agent
    result = agent.run()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 680, in run
    result = self._invoke_run(role)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 829, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 652, in _initialize_workers    
    self._rendezvous(worker_group)
    self._rendezvous(worker_group)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 124, in wrapper
    result = f(*args, **kwargs)
    result = f(*args, **kwargs)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
    results = model.train(
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
    results = model.train(
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Traceback (most recent call last):
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
    results = model.train(
  File "c:\Users\aldas\yolo-nepenthes-model\minimalExample.py", line 47, in <module>
    results = model.train(
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\model.py", line 802, in train
    self.trainer.train()
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 205, in train
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 205, in train
    raise e
    raise e
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 203, in train
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\site-packages\ultralytics\engine\trainer.py", line 203, in train
    subprocess.run(cmd, check=True)
  File "C:\Users\aldas\miniconda3\envs\yolo_env_py310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\aldas\\miniconda3\\envs\\yolo_env_py310\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '55668', 'C:\\Users\\aldas\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_in20ydga1929515286480.py']' returned non-zero exit status 1. 
Y-T-G commented 4 hours ago

You just need to replace the function with this:

def _setup_ddp(self, world_size):
        """Initializes and sets the DistributedDataParallel parameters for training."""
        torch.cuda.set_device(RANK)
        self.device = torch.device("cuda", RANK)
        # LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
        os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1"  # set to enforce timeout
        dist.init_process_group(
            backend="nccl" if dist.is_nccl_available() else "gloo",
            init_method="env://?use_libuv=False",
            timeout=timedelta(seconds=10800),  # 3 hours
            rank=RANK,
            world_size=world_size,
        )

You don't need to perform the import or do trainer._setup_ddp = _setup_ddp.

Y-T-G commented 4 hours ago

You can also try downgrading to PyTorch 2.3.x

https://github.com/RVC-Boss/GPT-SoVITS/issues/1357#issuecomment-2255295246