ultralytics / ultralytics

NEW - YOLOv8 πŸš€ in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
23.75k stars 4.74k forks source link

CalledProcessError while training yolov8 on kaggle #11682

Open ammar3010 opened 1 week ago

ammar3010 commented 1 week ago

Search before asking

YOLOv8 Component

No response

Bug

`CalledProcessError Traceback (most recent call last) Cell In[11], line 4 1 from ultralytics import YOLO 3 model = YOLO('yolov8m.pt') # load a pretrained model (recommended for training) ----> 4 results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device=[0,1])

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/model.py:673, in Model.train(self, trainer, **kwargs) 670 pass 672 self.trainer.hub_session = self.session # attach optional HUB session --> 673 self.trainer.train() 674 # Update model and cfg after training 675 if RANK in {-1, 0}:

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/trainer.py:194, in BaseTrainer.train(self) 192 subprocess.run(cmd, check=True) 193 except Exception as e: --> 194 raise e 195 finally: 196 ddp_cleanup(self, str(file))

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/trainer.py:192, in BaseTrainer.train(self) 190 try: 191 LOGGER.info(f'{colorstr("DDP:")} debug command {" ".join(cmd)}') --> 192 subprocess.run(cmd, check=True) 193 except Exception as e: 194 raise e

File /opt/conda/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs) 524 retcode = process.poll() 525 if check and retcode: --> 526 raise CalledProcessError(retcode, process.args, 527 output=stdout, stderr=stderr) 528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['/opt/conda/bin/python3.10', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '51851', '/root/.config/Ultralytics/DDP/_temp_y5tupmac138531641881456.py']' returned non-zero exit status 1.`

Environment

Ultralytics YOLOv8.2.7 πŸš€ Python-3.10.13 torch-2.1.2 CUDA:0 (Tesla T4, 15102MiB) CUDA:1 (Tesla T4, 15102MiB)

Minimal Reproducible Example

`from ultralytics import YOLO

model = YOLO('yolov8m.pt') # load a pretrained model (recommended for training) results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device=[0,1])`

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 1 week ago

πŸ‘‹ Hello @ammar3010, thank you for your interest in Ultralytics YOLOv8 πŸš€! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a πŸ› Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 1 week ago

It seems like you're encountering a CalledProcessError during the distributed training setup. This error usually indicates a problem with the subprocess that the training script tries to run, which may involve issues with environment configurations, dependencies, or CUDA setup.

Here are a couple of things you might try to resolve this issue:

  1. Ensure that the CUDA version you are using is compatible with the installed version of PyTorch. Sometimes mismatches here can cause issues.
  2. Verify that the node configurations and environment in your Kaggle kernel support distributed training. You might be limited by the number of GPUs available or other system resources.

You could simplify your setup to a single device to isolate the problem and confirm if the issue still persists:

results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device='0')

This modification runs the training only on one GPU (device 0). If this succeeds, the issue might be specifically related to multi-GPU setup in your environment. Feel free to share any further error messages if the issue continues! πŸ› οΈ