ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.99k stars 16.16k forks source link

stuck training on NVIDIA H100 #13010

Closed SoraJung closed 3 months ago

SoraJung commented 4 months ago

Search before asking

Question

I am training my custom dataset on NVIDIA H100 (80GB HBM3, 81008MiB), only single gpu but training stuck after model summary. It works well on NVIDIA GeForce RTX 2080 Ti, RTX 3090.

I don't know why it does not work on H100. I need your help.

Training command: ` root@548fdf5867cc:/usr/src/app# python train.py train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False github: up to date with https://github.com/ultralytics/yolov5 βœ… YOLOv5 πŸš€ v7.0-312-g1bcd17ee Python-3.10.9 torch-2.0.0 CUDA:0 (NVIDIA H100 80GB HBM3, 81008MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 πŸš€ runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Dataset not found ⚠️, missing paths ['/usr/src/datasets/coco128/images/train2017'] Downloading https://ultralytics.com/assets/coco128.zip to coco128.zip... 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6.66M/6.66M [00:01<00:00, 6.83MB/s] Dataset download success βœ… (3.4s), saved to /usr/src/datasets

             from  n    params  module                                  arguments

0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 2 115712 models.common.C3 [128, 128, 2] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 1182720 models.common.C3 [512, 512, 1] 9 -1 1 656896 models.common.SPPF [512, 512, 5] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model summary: 214 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs `

Additional

No response

github-actions[bot] commented 4 months ago

πŸ‘‹ Hello @SoraJung, thank you for your interest in YOLOv5 πŸš€! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a πŸ› Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 πŸš€

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 πŸš€!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
glenn-jocher commented 4 months ago

@SoraJung hey there! It seems like you’re encountering an issue with training on the NVIDIA H100 GPU. Given that it works well on RTX 2080 Ti and RTX 3090, there are a couple possibilities to consider:

  1. Driver Compatibility: Ensure that your NVIDIA drivers and CUDA are compatible with the H100. The H100 is a newer and more advanced GPU, which can sometimes need different driver settings or updates compared to older GPUs like the 2080 Ti or 3090.

  2. PyTorch Version: Since you're using PyTorch 2.0.0, check for any known issues with PyTorch that specifically affect new GPU models like the H100. Sometimes updating or rolling back PyTorch can solve these compatibility issues.

  3. CUDA Version: Double-check you're deploying the correct CUDA version that fully supports your hardware. The hardware might require the latest CUDA toolkit, which should be compatible with your current software and drivers as well.

If everything seems to be in order and the issue persists, could you provide any specific error message or output that stops your training? This might give more insight into what's going wrong. Thanks! πŸš€

github-actions[bot] commented 3 months ago

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

lethee commented 1 month ago

I have same situation Yolov7 with H100. It's Nvidia architecture issue. Check your CUDA driver version. 11.8 or higher must be required for H100. I have use nvcr.io/nvidia/pytorch:22.09-py3 container image.

glenn-jocher commented 1 month ago

Hi @lethee,

Thank you for sharing your experience with the NVIDIA H100. It’s great to see the community helping each other out! 😊

Indeed, the H100 GPU requires CUDA 11.8 or higher. Ensuring that your CUDA driver version is up-to-date is crucial for compatibility with newer architectures like the H100. Using the nvcr.io/nvidia/pytorch:22.09-py3 container image is a good approach as it comes pre-configured with the necessary dependencies.

For anyone facing similar issues, here are a few steps to ensure compatibility:

  1. Update CUDA Drivers: Make sure you have CUDA 11.8 or higher installed. You can check your current CUDA version with:

    nvcc --version
  2. Use Compatible Docker Image: Using a pre-configured Docker image like nvcr.io/nvidia/pytorch:22.09-py3 can simplify the setup process. You can pull and run this image with:

    docker pull nvcr.io/nvidia/pytorch:22.09-py3
    docker run --gpus all -it nvcr.io/nvidia/pytorch:22.09-py3
  3. Verify PyTorch and CUDA Compatibility: Ensure that your PyTorch installation is compatible with your CUDA version. You can install the appropriate version of PyTorch with:

    pip install torch==2.0.0+cu118 torchvision==0.15.0+cu118 -f https://download.pytorch.org/whl/torch_stable.html

If the issue persists after these steps, please ensure that you are using the latest version of YOLOv5 by pulling the latest changes from the repository. If the problem continues, feel free to open an issue with detailed logs and system information so we can assist you further.

Thank you for your patience and for being a part of the YOLO community! πŸš€