CalledProcessError while training yolov8 on kaggle #11682

Open ammar3010 opened 1 week ago

ammar3010 commented 1 week ago

`CalledProcessError Traceback (most recent call last) Cell In[11], line 4 1 from ultralytics import YOLO 3 model = YOLO('yolov8m.pt') # load a pretrained model (recommended for training) ----> 4 results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device=[0,1])

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/model.py:673, in Model.train(self, trainer, **kwargs) 670 pass 672 self.trainer.hub_session = self.session # attach optional HUB session --> 673 self.trainer.train() 674 # Update model and cfg after training 675 if RANK in {-1, 0}:

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/trainer.py:194, in BaseTrainer.train(self) 192 subprocess.run(cmd, check=True) 193 except Exception as e: --> 194 raise e 195 finally: 196 ddp_cleanup(self, str(file))

File /opt/conda/lib/python3.10/site-packages/ultralytics/engine/trainer.py:192, in BaseTrainer.train(self) 190 try: 191 LOGGER.info(f'{colorstr("DDP:")} debug command {" ".join(cmd)}') --> 192 subprocess.run(cmd, check=True) 193 except Exception as e: 194 raise e

File /opt/conda/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs) 524 retcode = process.poll() 525 if check and retcode: --> 526 raise CalledProcessError(retcode, process.args, 527 output=stdout, stderr=stderr) 528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['/opt/conda/bin/python3.10', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '51851', '/root/.config/Ultralytics/DDP/_temp_y5tupmac138531641881456.py']' returned non-zero exit status 1.`


Ultralytics YOLOv8.2.7 πŸš€ Python-3.10.13 torch-2.1.2 CUDA:0 (Tesla T4, 15102MiB) CUDA:1 (Tesla T4, 15102MiB)

`from ultralytics import YOLO

model = YOLO('yolov8m.pt') # load a pretrained model (recommended for training) results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device=[0,1])`


glenn-jocher commented 1 week ago

It seems like you're encountering a CalledProcessError during the distributed training setup. This error usually indicates a problem with the subprocess that the training script tries to run, which may involve issues with environment configurations, dependencies, or CUDA setup.

Here are a couple of things you might try to resolve this issue:

  1. Ensure that the CUDA version you are using is compatible with the installed version of PyTorch. Sometimes mismatches here can cause issues.
  2. Verify that the node configurations and environment in your Kaggle kernel support distributed training. You might be limited by the number of GPUs available or other system resources.

You could simplify your setup to a single device to isolate the problem and confirm if the issue still persists:

results = model.train(data='/kaggle/input/headcount/data.yaml', epochs=30, imgsz=480, device='0')

This modification runs the training only on one GPU (device 0). If this succeeds, the issue might be specifically related to multi-GPU setup in your environment. Feel free to share any further error messages if the issue continues! πŸ› οΈ