CalledProcessError while training yolov8 on kaggle

ammar3010 commented 1 week ago

Search before asking

[X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

No response

Bug

`CalledProcessError Traceback (most recent call last) Cell In[11], line 4 1 from ultralytics import YOLO 3 model = YOLO('yolov8m.pt') # load a pretrained model (recommended for training) ----> 4 results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device=[0,1])

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/model.py:673, in Model.train(self, trainer, **kwargs) 670 pass 672 self.trainer.hub_session = self.session # attach optional HUB session --> 673 self.trainer.train() 674 # Update model and cfg after training 675 if RANK in {-1, 0}:

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/trainer.py:194, in BaseTrainer.train(self) 192 subprocess.run(cmd, check=True) 193 except Exception as e: --> 194 raise e 195 finally: 196 ddp_cleanup(self, str(file))

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/trainer.py:192, in BaseTrainer.train(self) 190 try: 191 LOGGER.info(f'{colorstr("DDP:")} debug command {" ".join(cmd)}') --> 192 subprocess.run(cmd, check=True) 193 except Exception as e: 194 raise e

File /opt/conda/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs) 524 retcode = process.poll() 525 if check and retcode: --> 526 raise CalledProcessError(retcode, process.args, 527 output=stdout, stderr=stderr) 528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['/opt/conda/bin/python3.10', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '51851', '/root/.config/Ultralytics/DDP/_temp_y5tupmac138531641881456.py']' returned non-zero exit status 1.`

Environment

Ultralytics YOLOv8.2.7 🚀 Python-3.10.13 torch-2.1.2 CUDA:0 (Tesla T4, 15102MiB) CUDA:1 (Tesla T4, 15102MiB)

Minimal Reproducible Example

`from ultralytics import YOLO

model = YOLO('yolov8m.pt') # load a pretrained model (recommended for training) results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device=[0,1])`

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

github-actions[bot] commented 1 week ago

👋 Hello @ammar3010, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 1 week ago

It seems like you're encountering a CalledProcessError during the distributed training setup. This error usually indicates a problem with the subprocess that the training script tries to run, which may involve issues with environment configurations, dependencies, or CUDA setup.

Here are a couple of things you might try to resolve this issue:

Ensure that the CUDA version you are using is compatible with the installed version of PyTorch. Sometimes mismatches here can cause issues.
Verify that the node configurations and environment in your Kaggle kernel support distributed training. You might be limited by the number of GPUs available or other system resources.

You could simplify your setup to a single device to isolate the problem and confirm if the issue still persists:

results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device='0')

This modification runs the training only on one GPU (device 0). If this succeeds, the issue might be specifically related to multi-GPU setup in your environment. Feel free to share any further error messages if the issue continues! 🛠️

ultralytics / ultralytics