Closed SoraJung closed 3 months ago
π Hello @SoraJung, thank you for your interest in YOLOv5 π! Please visit our βοΈ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a π Bug Report, please provide a minimum reproducible example to help us debug it.
If this is a custom training β Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.
Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.
We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 π!
Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.
Check out our YOLOv8 Docs for details and get started with:
pip install ultralytics
@SoraJung hey there! It seems like youβre encountering an issue with training on the NVIDIA H100 GPU. Given that it works well on RTX 2080 Ti and RTX 3090, there are a couple possibilities to consider:
Driver Compatibility: Ensure that your NVIDIA drivers and CUDA are compatible with the H100. The H100 is a newer and more advanced GPU, which can sometimes need different driver settings or updates compared to older GPUs like the 2080 Ti or 3090.
PyTorch Version: Since you're using PyTorch 2.0.0, check for any known issues with PyTorch that specifically affect new GPU models like the H100. Sometimes updating or rolling back PyTorch can solve these compatibility issues.
CUDA Version: Double-check you're deploying the correct CUDA version that fully supports your hardware. The hardware might require the latest CUDA toolkit, which should be compatible with your current software and drivers as well.
If everything seems to be in order and the issue persists, could you provide any specific error message or output that stops your training? This might give more insight into what's going wrong. Thanks! π
π Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO π and Vision AI β
I have same situation Yolov7 with H100.
It's Nvidia architecture issue. Check your CUDA driver version. 11.8 or higher must be required for H100.
I have use nvcr.io/nvidia/pytorch:22.09-py3
container image.
Hi @lethee,
Thank you for sharing your experience with the NVIDIA H100. Itβs great to see the community helping each other out! π
Indeed, the H100 GPU requires CUDA 11.8 or higher. Ensuring that your CUDA driver version is up-to-date is crucial for compatibility with newer architectures like the H100. Using the nvcr.io/nvidia/pytorch:22.09-py3
container image is a good approach as it comes pre-configured with the necessary dependencies.
For anyone facing similar issues, here are a few steps to ensure compatibility:
Update CUDA Drivers: Make sure you have CUDA 11.8 or higher installed. You can check your current CUDA version with:
nvcc --version
Use Compatible Docker Image:
Using a pre-configured Docker image like nvcr.io/nvidia/pytorch:22.09-py3
can simplify the setup process. You can pull and run this image with:
docker pull nvcr.io/nvidia/pytorch:22.09-py3
docker run --gpus all -it nvcr.io/nvidia/pytorch:22.09-py3
Verify PyTorch and CUDA Compatibility: Ensure that your PyTorch installation is compatible with your CUDA version. You can install the appropriate version of PyTorch with:
pip install torch==2.0.0+cu118 torchvision==0.15.0+cu118 -f https://download.pytorch.org/whl/torch_stable.html
If the issue persists after these steps, please ensure that you are using the latest version of YOLOv5 by pulling the latest changes from the repository. If the problem continues, feel free to open an issue with detailed logs and system information so we can assist you further.
Thank you for your patience and for being a part of the YOLO community! π
Search before asking
Question
I am training my custom dataset on NVIDIA H100 (80GB HBM3, 81008MiB), only single gpu but training stuck after model summary. It works well on NVIDIA GeForce RTX 2080 Ti, RTX 3090.
I don't know why it does not work on H100. I need your help.
Training command: ` root@548fdf5867cc:/usr/src/app# python train.py train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False github: up to date with https://github.com/ultralytics/yolov5 β YOLOv5 π v7.0-312-g1bcd17ee Python-3.10.9 torch-2.0.0 CUDA:0 (NVIDIA H100 80GB HBM3, 81008MiB)
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 π runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Dataset not found β οΈ, missing paths ['/usr/src/datasets/coco128/images/train2017'] Downloading https://ultralytics.com/assets/coco128.zip to coco128.zip... 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6.66M/6.66M [00:01<00:00, 6.83MB/s] Dataset download success β (3.4s), saved to /usr/src/datasets
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 2 115712 models.common.C3 [128, 128, 2] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 1182720 models.common.C3 [512, 512, 1] 9 -1 1 656896 models.common.SPPF [512, 512, 5] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model summary: 214 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs `
Additional
No response