open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
7.97k stars 2.57k forks source link

The training did not report any errors, but it seemed like it had stopped, but the GPU was still in use #3403

Open nicekwow opened 10 months ago

nicekwow commented 10 months ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug

image The training did not report any errors, but it seemed like it had stopped, but the GPU was still in use image

Reproduction

  1. What command or script did you run?

    python tools/train.py configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512_mydata.py
  2. Did you make any modifications on the code or config? Did you understand what you have modified? I used my own dataset for training and loaded the pre training weights obtained from mmpretrain

  3. What dataset did you use? my own dataset,like this image There is such content in both train and val image

Environment

sys.platform: win32 Python: 3.8.0 (default, Nov 6 2019, 16:00:02) [MSC v.1916 64 bit (AMD64)] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3080 CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\bin\nvcc.exe C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin\nvcc.exe C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4 MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.29.30145 版 GCC: n/a PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1 OpenCV: 4.8.1 MMEngine: 0.9.0 MMSegmentation: 1.2.1+cbf9af1

lxy-mini commented 1 month ago

You can check your data format. I used rgb label and had the same problem as you.