open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.09k stars 9.38k forks source link

[Bug] CUDA out of memory in RTMDet-Ins on custom dataset with > 100 ground truths per img #9616

Open mira-murali opened 1 year ago

mira-murali commented 1 year ago

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x

Environment

mira@Dell-Precision:/mmdetection$ python3 mmdet/utils/collect_env.py
sys.platform: linux
Python: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 2060
GPU 1: NVIDIA T400
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.13.1+cu116
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.14.1+cu116
OpenCV: 4.6.0
MMEngine: 0.3.0
MMDetection: 3.0.0rc4+7185b5a

Additional installation/environment information

Installed inside a docker container based on the example Dockerfile but pulls dev-3.x because I started working on this before it was merged into 3.x but I verified that there haven't been any changes to the specific code snippets that would help with the OOM error.

RUN apt-get update \
    && apt-get install --no-install-recommends -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMEngine and MMCV
RUN pip install openmim && \
    mim install "mmengine==0.3.0" "mmcv>=2.0.0rc1"

# Install MMDetection
RUN git clone https://github.com/open-mmlab/mmdetection.git -b dev-3.x /mmdetection \
    && cd /mmdetection \
    && pip install --no-cache-dir -e .

Reproduces the problem - code sample

Config file run for training. Classes and meta info hidden for privacy:

# rtmdet-ins_tiny_1xb2-200e.py
_base_ = "/mmdetection/configs/rtmdet/rtmdet-ins_tiny_8xb32-300e_coco.py"

checkpoint = (
    "https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth"  # noqa
)

data_root = "/home/mira/RTMDet/rtmdet_ins_data/"

model = dict(bbox_head=dict(num_classes=5, in_channels=96, feat_channels=96))

train_pipeline_stage2 = [
    dict(
        type='LoadImageFromFile',
        file_client_args={{_base_.file_client_args}}),
    dict(
        type='LoadAnnotations',
        with_bbox=True,
        with_mask=True,
        poly2mask=False),
    dict(
        type='RandomResize',
        scale=(1280, 720),
        ratio_range=(0.5, 2.0),
        keep_ratio=True),
    dict(
        type='RandomCrop',
        crop_size=(640, 480),
        recompute_bbox=True,
        allow_negative_crop=True),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', prob=0.5),
    dict(type='Pad', size=(640, 480), pad_val=dict(img=(114, 114, 114))),
    dict(type='PackDetInputs')
]

log_interval = 20
val_epoch_interval = 10
max_epochs = 200
stage2_num_epochs = 10
base_lr = 0.004

train_cfg = dict(
    max_epochs=max_epochs, val_interval=val_epoch_interval, dynamic_intervals=[(max_epochs - stage2_num_epochs, 1)]
)

train_dataloader = dict(
    batch_size=2,
    dataset=dict(
        metainfo=metainfo,
        data_root=data_root,
        ann_file="coco_labels/train_annotations2023.json",
        data_prefix=dict(img="train_images/"),
    ),
)

val_dataloader = dict(
    dataset=dict(
        ann_file="coco_labels/val_annotations2023.json",
        metainfo=metainfo,
        data_root=data_root,
        data_prefix=dict(img="val_images/"),
    )
)

test_dataloader = dict(
    dataset=dict(
        ann_file="coco_labels/test_annotations2023.json",
        metainfo=metainfo,
        data_root=data_root,
        data_prefix=dict(img="test_images/"),
    )
)

val_evaluator = dict(ann_file=data_root + "coco_labels/val_annotations2023.json")
test_evaluator = dict(ann_file=data_root + "coco_labels/test_annotations2023.json")

param_scheduler = [
    dict(
        # use cosine lr from 150 to 300 epoch
        type="CosineAnnealingLR",
        eta_min=base_lr * 0.05,
        begin=max_epochs // 2,
        end=max_epochs,
        T_max=max_epochs // 2,
        by_epoch=True,
        convert_to_iter_based=True,
    )
]
default_hooks = dict(
    logger=dict(type="LoggerHook", interval=log_interval),
    checkpoint=dict(interval=val_epoch_interval, max_keep_ckpts=3),
)  # only keep latest 3 checkpoints
custom_hooks = [
    dict(type="EMAHook", ema_type="ExpMomentumEMA", momentum=0.0002, update_buffers=True, priority=49),
    dict(type="PipelineSwitchHook", switch_epoch=max_epochs - stage2_num_epochs, switch_pipeline=train_pipeline_stage2),
]

### Reproduces the problem - command or script

mira@Dell-Precision:/mmdetection$ python3 tools/train.py ~/RTMDet/configs/rtmdet-ins_tiny_1xb2-200e.py --work-dir ~/RTMDet/Exp2_logs --resume ~/RTMDet/Exp1_logs/epoch_30.pth

### Reproduces the problem - error message

01/11 17:31:41 - mmengine - INFO - Epoch(train) [31][20/41]  lr: 4.0000e-03  eta: 2:26:33  time: 1.3264  data_time: 0.0113  memory: 4257  loss: 1.4234  loss_cls: 0.3436  loss_bbox: 0.5839  loss_mask: 0.4959
01/11 17:32:28 - mmengine - INFO - Epoch(train) [31][40/41]  lr: 4.0000e-03  eta: 3:29:35  time: 1.7126  data_time: 0.0127  memory: 7018  loss: 1.4461  loss_cls: 0.3816  loss_bbox: 0.5821  loss_mask: 0.4823
01/11 17:32:30 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-200e_20230111_173110
01/11 17:33:31 - mmengine - INFO - Epoch(train) [32][20/41]  lr: 4.0000e-03  eta: 4:15:49  time: 2.4690  data_time: 0.0107  memory: 7136  loss: 1.4545  loss_cls: 0.3468  loss_bbox: 0.6041  loss_mask: 0.5036
Traceback (most recent call last):
  File "tools/train.py", line 130, in <module>
    main()
  File "tools/train.py", line 126, in main
    runner.train()
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/runner.py", line 1661, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 90, in run
    self.run_epoch()
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 106, in run_epoch
    self.run_iter(idx, data_batch)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 122, in run_iter
    outputs = self.runner.model.train_step(
  File "/usr/local/lib/python3.8/dist-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/mmengine/model/base_model/base_model.py", line 320, in _run_forward
    results = self(**data, mode=mode)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/mmdetection/mmdet/models/detectors/single_stage.py", line 78, in loss
    losses = self.bbox_head.loss(x, batch_data_samples)
  File "/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/mmdetection/mmdet/models/dense_heads/rtmdet_ins_head.py", line 748, in loss_by_feat
    loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
  File "/mmdetection/mmdet/models/dense_heads/rtmdet_ins_head.py", line 653, in loss_mask_by_feat
    loss_mask = self.loss_mask(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mmdetection/mmdet/models/losses/dice_loss.py", line 137, in forward
    loss = self.loss_weight * dice_loss(
  File "/mmdetection/mmdet/models/losses/dice_loss.py", line 47, in dice_loss
    a = torch.sum(input * target, 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 478.00 MiB (GPU 0; 11.75 GiB total capacity; 8.98 GiB already allocated; 77.44 MiB free; 10.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

Expected Result

Training without an OOM error.

Dataset

Custom dataset of 100 images for instance segmentation with >200 polygons per image. Resolution of images: 1280 x 720

Hardware

NVIDIA RTX 2060. Also tried training on NVIDIA RTX 3080. Both have a GPU memory of 12 GB.

Additional description/information

Based on reading the FAQ and looking through issues #188 and [#1581], (https://github.com/open-mmlab/mmdetection/issues/1581), and given the high number of ground truth per image, I assumed that the problem was that assign_gpu_thr needed to be set to a number so the assign computation takes place in the CPU instead of the GPU.

However, rtmdet uses DynamicSoftLabelAssigner and not MaxIoUAssigner, which does not have an assign_gpu_thr parameter that is configurable. Switching the assigner to MaxIoUAssigner in the config as shown below

model = dict(bbox_head=dict(num_classes=5, in_channels=96, feat_channels=96),  train_cfg=dict(
        assigner=dict(type='MaxIoUAssigner', pos_iou_thr=0.5,
                    neg_iou_thr=0.5,
                    min_pos_iou=0.5,
                    match_low_quality=False,
                    ignore_iof_thr=-1, gpu_assign_thr=5),
        allowed_border=-1,
        pos_weight=-1,
        debug=False))

resulted in the following output:

01/11 20:00:57 - mmengine - INFO - load backbone. in model from: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
http loads checkpoint from path: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
01/11 20:00:57 - mmengine - INFO - Checkpoints will be saved to /home/mira/RTMDet/Exp19_maxiou_logs.
/usr/local/lib/python3.8/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
01/11 20:01:06 - mmengine - INFO - Epoch(train) [1][20/41]  lr: 4.0000e-03  eta: 0:00:27  time: 0.4417  data_time: 0.0270  memory: 1072  loss: 0.7659  loss_cls: 0.2922  loss_bbox: 0.1749  loss_mask: 0.2988
01/11 20:01:19 - mmengine - INFO - Epoch(train) [1][40/41]  lr: 4.0000e-03  eta: 0:00:23  time: 0.5636  data_time: 0.0140  memory: 1735  loss: 0.9751  loss_cls: 0.3478  loss_bbox: 0.2383  loss_mask: 0.3890
01/11 20:01:20 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-2e_20230111_200052
01/11 20:01:20 - mmengine - INFO - Saving checkpoint at 1 epochs
01/11 20:01:22 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
01/11 20:01:22 - mmengine - ERROR - /mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 437 - The testing results of the whole dataset is empty.
01/11 20:01:22 - mmengine - INFO - Epoch(val) [1][1/1]  
01/11 20:01:22 - mmengine - INFO - Switch pipeline now!
01/11 20:01:27 - mmengine - INFO - Epoch(train) [2][20/41]  lr: 2.3179e-03  eta: 0:00:09  time: 0.4578  data_time: 0.0163  memory: 821  loss: 0.6599  loss_cls: 0.2235  loss_bbox: 0.1650  loss_mask: 0.2714
01/11 20:01:32 - mmengine - INFO - Epoch(train) [2][40/41]  lr: 2.2227e-04  eta: 0:00:00  time: 0.3217  data_time: 0.0164  memory: 800  loss: 0.3189  loss_cls: 0.0869  loss_bbox: 0.0804  loss_mask: 0.1516
01/11 20:01:32 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-1e_tomato_20230111_200052
01/11 20:01:32 - mmengine - INFO - Saving checkpoint at 2 epochs
01/11 20:01:34 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
01/11 20:01:34 - mmengine - ERROR - /mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 437 - The testing results of the whole dataset is empty.
01/11 20:01:34 - mmengine - INFO - Epoch(val) [2][1/1]

However, switching to the MaxIouAssigner did not lead to an OOM error for multiple epochs, which biases me to think the problem is the high number of polygons. But inference with the trained model outputs no predictions and as shown in the log above, throws an error saying that the testing results of the whole dataset is empty. Reading through the issues (#9381), this is sometimes attributed to incorrect format of the ground truth labels but since the data has not changed, this doesn't seem plausible.

To summarize:

  1. Apart from increasing GPU memory, are there any other solutions to this problem?
  2. Is there a feature to pass in assign_gpu_thr to DynamicSoftLabelAssigner?
  3. Why does using MaxIoUAssigner with RTMDet result in no inference results? This seems like a bug.
  4. The argument with_cp (suggested in the FAQ for OOM issues) does not exist in CSPNext which is the backbone for RTMDet. Are there plans to add it?
  5. Documentation for RTMDet seems to lack information on how to train with FP16 if that's an option as suggested in the FAQs. Please advise.

I'm not sure if this is entirely a bug or a feature request but it seems to be a bit of both.

RangiLyu commented 1 year ago

Thanks for your bug report! We are working on optimizing the memory footprint of RTMDet.

mira-murali commented 1 year ago

Okay, for now, does that mean none of the suggestions in the FAQ apply for RTMDet?

RangiLyu commented 1 year ago

Try to add '--amp' to enable fp16 training.

mchaniotakis commented 1 year ago

I have also noticed this increasing memory trait in the first and second epoch.

lyf6 commented 1 year ago

have this problem solved?

twmht commented 1 year ago

+1 same problem with fp16 training, always OOM in the middle of training.

mira-murali commented 1 year ago

I don't think it's been solved but I haven't checked the latest updates. fp16 training didn't work for me either. I eventually just ended up using an AWS instance with a higher GPU memory and a lower batch size to be able to train.

qwert31639 commented 1 year ago

I don't think it's been solved but I haven't checked the latest updates. fp16 training didn't work for me either. I eventually just ended up using an AWS instance with a higher GPU memory and a lower batch size to be able to train.

maybe add "@torch.no_grad()" can solve your problem(?) image https://github.com/open-mmlab/mmdetection/blob/61dd8d518b13c7ee4bdf609595b7e803f3ac0224/mmdet/models/task_modules/assigners/dynamic_soft_label_assigner.py#L66

SimonGuoNjust commented 1 year ago

In my case, this problem is solved by adding "@AvoidCUDAOOM.retry_if_cuda_oom" to loss_mask_by_feat function as well as adding a max_mask_to_train limitation to constrain the number of masks which is fed to loss module(similar to YOLACT implementation). I am not sure which modification works.

Lopside1 commented 1 year ago

I get this error but only on the validation steps. Even if I set the batch size to 4 and use 30% GPU RAM (4090 RTX 24GB) the memory is stable but then during validation steps the GPU memory is wildly varying.

I only need the one mask per image so if this is the cause does anyone know how I set this to be a much lower value on the validation steps?

mira-murali commented 1 year ago

@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn't seem to make a difference for me.

@qwert31639 I added @torch.no_grad() but it also doesn't seem to make much of a difference.

I am trying to run on multiple (24 GB) GPUs using /mmdetection/tools/dist_train.sh and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?

melaanya commented 1 year ago

I have the same error of CUDA memory error during validation (training finishes fine, with only ~30% memory occupied) with a single GPU setting

caj-github commented 1 year ago

same error

leo-q8 commented 1 year ago

I have also noticed this increasing memory trait in the first and second epoch.

Me too. And when I add --amp param, nan loss appear in log.

SimonGuoNjust commented 1 year ago

@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn't seem to make a difference for me.

@qwert31639 I added @torch.no_grad() but it also doesn't seem to make much of a difference.

I am trying to run on multiple (24 GB) GPUs using /mmdetection/tools/dist_train.sh and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?

You can refer to the implementation of Condinst. Simply put, only a subset of masks prediction which are randomly selected from the positive samples will be used to compute loss. I also refer to max_iou_assigner and add a gpu assign threshold to dynamic_soft_label_assigner in order to prevent cuda OOM during label assignment. I think the OOM problem is most likely to occur during the label assignment and loss back propagation. @AvoidCUDAOOM.retry_if_cuda_oom can not convert the InstanceData format input to fp16 when OOM occurs so it won't work.

I have also noticed this increasing memory trait in the first and second epoch.

Me too. And when I add --amp param, nan loss appear in log.

I also encountered this problem. It seems that the loss of RTMDet-ins may exceed the range of fp16 during training and then becomes nan. So I just turned-off the amp mode.

JKelle commented 1 year ago

I have two RTMDet-Ins projects that both experience a CUDA OOM error. The smaller project uses ~95% of memory for several hundred iterations, then eventually runs out of memory in the dice loss computation. The larger project runs out after just 10 or fewer iterations a little earlier in loss_mask_by_feat.

I tried putting the @AvoidCUDAOOM.retry_if_cuda_oom decorator on loss_mask_by_feat. This seems to have resolved the issue for the smaller project, but not the larger one. All of the following stack traces are for the larger project:

Here is the original error message:

Traceback (most recent call last):
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
      runner.train()
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
      model = self.train_loop.run()  # type: ignore
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
      self.run_iter(data_batch)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
      outputs = self.runner.model.train_step(
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
      losses = self._run_forward(data, mode='loss')
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
      results = self(**data, mode=mode)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
      return forward_call(*input, **kwargs)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
      output = self.module(*inputs[0], **kwargs[0])
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
      return forward_call(*input, **kwargs)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
      return self.loss(inputs, data_samples)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
      losses = self.bbox_head.loss(x, batch_data_samples)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
      losses = self.loss_by_feat(*loss_inputs)
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
      loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
    File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 632, in loss_mask_by_feat
      pos_gt_masks = torch.cat(pos_gt_masks, 0)
  RuntimeError: CUDA out of memory. Tried to allocate 2.56 GiB (GPU 0; 14.61 GiB total capacity; 8.73 GiB already allocated; 2.10 GiB free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

After putting @AvoidCUDAOOM.retry_if_cuda_oom on loss_mask_by_feat:

06/15 16:48:43 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7efd15947ca0> to FP16 due to CUDA OOM
...
Traceback (most recent call last):
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
    ret = record(fn)(*args_)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
    runner.train()
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
    self.run_iter(data_batch)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
    results = self(**data, mode=mode)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
    losses = self.bbox_head.loss(x, batch_data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
    loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 172, in wrapped
    output = func(*fp16_args, **fp16_kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 622, in loss_mask_by_feat
    pos_mask_logits = self._mask_predict_by_feat_single(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 586, in _mask_predict_by_feat_single
    x = F.conv2d(
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

When I move the decorator up to loss_by_feat, I get the following error:

Traceback (most recent call last):
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
    ret = record(fn)(*args_)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
    runner.train()
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
    self.run_iter(data_batch)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
    results = self(**data, mode=mode)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
    losses = self.bbox_head.loss(x, batch_data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 148, in wrapped
    return func(*args, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 720, in loss_by_feat
    gt_instances.masks = gt_instances.masks.to_tensor(
AttributeError: 'Tensor' object has no attribute 'to_tensor'

I moved the decorator back to loss_mask_by_feat and commented out the part of retry_if_cuda_oom that does fp16 conversion so it skips straight to using the CPU.

06/15 16:52:23 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7f4ebd75cc10> to CPU due to CUDA OOM
06/15 16:52:23 - mmengine - WARNING - Convert outputs to GPU (device=cuda:0)
...
Traceback (most recent call last):
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
    ret = record(fn)(*args_)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
    runner.train()
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
    self.run_iter(data_batch)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
    results = self(**data, mode=mode)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
    losses = self.bbox_head.loss(x, batch_data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
    loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 192, in wrapped
    output = func(*cpu_args, **cpu_kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 622, in loss_mask_by_feat
    pos_mask_logits = self._mask_predict_by_feat_single(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 574, in _mask_predict_by_feat_single
    relative_coord = (points - coord).permute(0, 2, 1) / (
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I returned the retry_if_cuda_oom code back to normal and changed my optimizer type to AmpOptimWrapper. I can see the OOM triggered conversion to FP16 which didn't fail this time but did OOM, leading to moving inputs to the CPU which then failed. Training lasted a few hundred iterations this time, which is a lot longer than the ~10 iterations from before. I'm not sure if this is a coincidence or not.

06/15 17:18:40 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7fcbf4c61c10> to FP16 due to CUDA OOM
06/15 17:18:41 - mmengine - WARNING - Using FP16 still meet CUDA OOM
06/15 17:18:41 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7fcbf4c61c10> to CPU due to CUDA OOM
06/15 17:18:41 - mmengine - WARNING - Convert outputs to GPU (device=cuda:0)
...
Traceback (most recent call last):
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
    ret = record(fn)(*args_)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
    runner.train()
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
    self.run_iter(data_batch)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
    results = self(**data, mode=mode)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
    losses = self.bbox_head.loss(x, batch_data_samples)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
    loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 191, in wrapped
    output = func(*cpu_args, **cpu_kwargs)
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 622, in loss_mask_by_feat
    pos_mask_logits = self._mask_predict_by_feat_single(
  File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 574, in _mask_predict_by_feat_single
    relative_coord = (points - coord).permute(0, 2, 1) / (
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

It seems to me the implementation of loss_mask_by_feat is not compatible with AvoidCUDAOOM.retry_if_cuda_oom.

JKelle commented 1 year ago

I made a workaround by creating a new type of Sampler that limits the number of positive detections. I basically copied code from the PseudoSampler and RandomSampler classes.

@TASK_UTILS.register_module()
class CapNumPosSampler(BaseSampler):

    def __init__(self, max_num_pos: int, **kwargs):
        self.max_num_pos = max_num_pos

    def _sample_neg(self, **kwargs):
        raise NotImplementedError

    def random_choice(self, gallery: Tensor, num: int) -> Tensor:
        """Random select some elements from the gallery.

        Args:
            gallery (Tensor): indices pool.
            num (int): expected sample num.

        Returns:
            Tensor: sampled indices.
        """
        assert len(gallery) >= num

        is_tensor = isinstance(gallery, torch.Tensor)
        assert is_tensor, 'Only support Tensor now, got {}'.format(type(gallery))
        if not is_tensor:
            if torch.cuda.is_available():
                device = torch.cuda.current_device()
            else:
                device = 'cpu'
            gallery = torch.tensor(gallery, dtype=torch.long, device=device)
        # This is a temporary fix. We can revert the following code
        # when PyTorch fixes the abnormal return of torch.randperm.
        # See: https://github.com/open-mmlab/mmdetection/pull/5014
        perm = torch.randperm(gallery.numel())[:num].to(device=gallery.device)
        rand_inds = gallery[perm]
        return rand_inds

    def _sample_pos(self, assign_result: AssignResult, num_expected: int) -> Tensor:
        """Randomly sample some positive samples.

        Args:
            assign_result (:obj:`AssignResult`): Bbox assigning results.
            num_expected (int): The number of expected positive samples

        Returns:
            Tensor or ndarray: sampled indices.
        """
        pos_inds = torch.nonzero(assign_result.gt_inds > 0, as_tuple=False)
        if pos_inds.numel() != 0:
            pos_inds = pos_inds.squeeze(1)
        if pos_inds.numel() <= num_expected:
            return pos_inds
        else:
            return self.random_choice(pos_inds, num_expected)

    def sample(self, assign_result: AssignResult, pred_instances: InstanceData,
               gt_instances: InstanceData, *args, **kwargs):
        """Directly returns the positive and negative indices  of samples.

        Args:
            assign_result (:obj:`AssignResult`): Bbox assigning results.
            pred_instances (:obj:`InstanceData`): Instances of model
                predictions. It includes ``priors``, and the priors can
                be anchors, points, or bboxes predicted by the model,
                shape(n, 4).
            gt_instances (:obj:`InstanceData`): Ground truth of instance
                annotations. It usually includes ``bboxes`` and ``labels``
                attributes.

        Returns:
            :obj:`SamplingResult`: sampler results
        """
        gt_bboxes = gt_instances.bboxes
        priors = pred_instances.priors

        pos_inds = self._sample_pos(assign_result, self.max_num_pos).unique()
        neg_inds = torch.nonzero(
            assign_result.gt_inds == 0, as_tuple=False).squeeze(-1).unique()

        gt_flags = priors.new_zeros(priors.shape[0], dtype=torch.uint8)
        sampling_result = SamplingResult(
            pos_inds=pos_inds,
            neg_inds=neg_inds,
            priors=priors,
            gt_bboxes=gt_bboxes,
            assign_result=assign_result,
            gt_flags=gt_flags,
            avg_factor_with_neg=False)
        return sampling_result

Then in my config:

...
    train_cfg=dict(
        sampler=dict(type="CapNumPosSampler", max_num_pos=2000),
        ...
    ),
...
custom_imports = dict(
    imports=[
        "cap_num_pos_sampler.py",
    ],
    allow_failed_imports=False,
)

Then when you run training, make sure to load the custom module:

from mmengine.utils import import_modules_from_strings
import_modules_from_strings(**cfg["custom_imports"])
likyoo commented 1 year ago

This bug also exists in mmyolo.

menggui1993 commented 1 year ago

For those who got the OOM error during validation, I've found one problem. During validation, the inference input will go through the val_pipeline, which contains the 'Resize'. So it's ok during inference. But in post-processing, it will interpolate the output mask to the original image size, then run sigmoid and thresholding to get the mask output. Refer to the code snippet bellow. https://github.com/open-mmlab/mmdetection/blob/f78af7785ada87f1ced75a2313746e4ba3149760/mmdet/models/dense_heads/rtmdet_ins_head.py#L498-L510 This could be extremely memory costing if your original image has a large resolution. e.g. a 4000x3000 image will have a mask output tensor of 100x4000x3000, which will cost over 4G memory!(and it's just a single tensor, there could be several temporary tensors with same size) I haven't found an effective solution yet. If you set the 'rescale' parameter to false, then the output mask won't be scaled to match the original image size. This will lead to wrong metric calculation. I tried to put the sigmoid before the interpolation, which do save some memory but not much. I think one solution would be to set the 'rescale' parameter to false, and when calculating validation metrics, resize the original image to match the output mask size.

echonax07 commented 11 months ago

@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn't seem to make a difference for me. @qwert31639 I added @torch.no_grad() but it also doesn't seem to make much of a difference. I am trying to run on multiple (24 GB) GPUs using /mmdetection/tools/dist_train.sh and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?

You can refer to the implementation of Condinst. Simply put, only a subset of masks prediction which are randomly selected from the positive samples will be used to compute loss. I also refer to max_iou_assigner and add a gpu assign threshold to dynamic_soft_label_assigner in order to prevent cuda OOM during label assignment. I think the OOM problem is most likely to occur during the label assignment and loss back propagation. @AvoidCUDAOOM.retry_if_cuda_oom can not convert the InstanceData format input to fp16 when OOM occurs so it won't work.

I have also noticed this increasing memory trait in the first and second epoch.

Me too. And when I add --amp param, nan loss appear in log.

I also encountered this problem. It seems that the loss of RTMDet-ins may exceed the range of fp16 during training and then becomes nan. So I just turned-off the amp mode.

@SimonGuoNjust Can you pls share your implementation of gpu_assign_thr in DynamicSoftLabelAssigner.

Thanks,

tsrobcvai commented 11 months ago

The bug still exists, I got the same error during validation.

echonax07 commented 11 months ago

I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.


@TASK_UTILS.register_module()
class DynamicSoftLabelAssigner(BaseAssigner):
    """Computes matching between predictions and ground truth with dynamic soft
    label assignment.

    Args:
        soft_center_radius (float): Radius of the soft center prior.
            Defaults to 3.0.
        topk (int): Select top-k predictions to calculate dynamic k
            best matches for each gt. Defaults to 13.
        iou_weight (float): The scale factor of iou cost. Defaults to 3.0.
        iou_calculator (ConfigType): Config of overlaps Calculator.
            Defaults to dict(type='BboxOverlaps2D').
    """

    def __init__(
            self,
            soft_center_radius: float = 3.0,
            topk: int = 13,
            iou_weight: float = 3.0,
            gpu_assign_thr: float = -1,
            iou_calculator: ConfigType = dict(type='BboxOverlaps2D')):

        self.soft_center_radius = soft_center_radius
        self.topk = topk
        self.iou_weight = iou_weight
        # ic(gpu_assign_thr)
        self.gpu_assign_thr = gpu_assign_thr
        self.iou_calculator = TASK_UTILS.build(iou_calculator)

    def assign(self,
               pred_instances: InstanceData,
               gt_instances: InstanceData,
               gt_instances_ignore: Optional[InstanceData] = None,
               **kwargs) -> AssignResult:
        """Assign gt to priors.

        Args:
            pred_instances (:obj:`InstanceData`): Instances of model
                predictions. It includes ``priors``, and the priors can
                be anchors or points, or the bboxes predicted by the
                previous stage, has shape (n, 4). The bboxes predicted by
                the current model or stage will be named ``bboxes``,
                ``labels``, and ``scores``, the same as the ``InstanceData``
                in other places.
            gt_instances (:obj:`InstanceData`): Ground truth of instance
                annotations. It usually includes ``bboxes``, with shape (k, 4),
                and ``labels``, with shape (k, ).
            gt_instances_ignore (:obj:`InstanceData`, optional): Instances
                to be ignored during training. It includes ``bboxes``
                attribute data that is ignored during training and testing.
                Defaults to None.
        Returns:
            obj:`AssignResult`: The assigned result.
        """
        gt_bboxes = gt_instances.bboxes
        gt_labels = gt_instances.labels
        num_gt = gt_bboxes.size(0)

        decoded_bboxes = pred_instances.bboxes
        pred_scores = pred_instances.scores
        priors = pred_instances.priors
        num_bboxes = decoded_bboxes.size(0)

        # ic(gt_bboxes.shape[0])
        # ic(self.gpu_assign_thr)

        assign_on_cpu = True if (self.gpu_assign_thr > 0) and (
            gt_bboxes.shape[0] > self.gpu_assign_thr) else False

        # ic(assign_on_cpu)

        # compute overlap and assign gt on CPU when number of GT is large
        if assign_on_cpu:
            # ic('assigning on cpu')
            device = priors.device
            priors = priors.cpu()
            gt_bboxes = gt_bboxes.cpu()
            gt_labels = gt_labels.cpu()
            decoded_bboxes = decoded_bboxes.cpu()
            pred_scores = pred_scores.cpu()

            # if gt_bboxes_ignore is not None:
            #     gt_bboxes_ignore = gt_bboxes_ignore.cpu()

        # assign 0 by default
        assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
                                                   0,
                                                   dtype=torch.long)
        if num_gt == 0 or num_bboxes == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            if num_gt == 0:
                # No truth, assign everything to background
                assigned_gt_inds[:] = 0
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

        prior_center = priors[:, :2]
        if isinstance(gt_bboxes, BaseBoxes):
            is_in_gts = gt_bboxes.find_inside_points(prior_center)
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            lt_ = prior_center[:, None] - gt_bboxes[:, :2]
            rb_ = gt_bboxes[:, 2:] - prior_center[:, None]

            deltas = torch.cat([lt_, rb_], dim=-1)
            is_in_gts = deltas.min(dim=-1).values > 0

        valid_mask = is_in_gts.sum(dim=1) > 0

        valid_decoded_bbox = decoded_bboxes[valid_mask]
        valid_pred_scores = pred_scores[valid_mask]
        num_valid = valid_decoded_bbox.size(0)

        if num_valid == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
        if hasattr(gt_instances, 'masks'):
            gt_center = center_of_mass(gt_instances.masks, eps=EPS)
        elif isinstance(gt_bboxes, BaseBoxes):
            gt_center = gt_bboxes.centers
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0
        valid_prior = priors[valid_mask]
        strides = valid_prior[:, 2]
        distance = (valid_prior[:, None, :2] - gt_center[None, :, :]
                    ).pow(2).sum(-1).sqrt() / strides[:, None]
        soft_center_prior = torch.pow(10, distance - self.soft_center_radius)

        pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes)
        iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight

        gt_onehot_label = (
            F.one_hot(gt_labels.to(torch.int64),
                      pred_scores.shape[-1]).float().unsqueeze(0).repeat(
                          num_valid, 1, 1))
        valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)

        soft_label = gt_onehot_label * pairwise_ious[..., None]
        scale_factor = soft_label - valid_pred_scores.sigmoid()
        soft_cls_cost = F.binary_cross_entropy_with_logits(
            valid_pred_scores, soft_label,
            reduction='none') * scale_factor.abs().pow(2.0)
        soft_cls_cost = soft_cls_cost.sum(dim=-1)

        cost_matrix = soft_cls_cost + iou_cost + soft_center_prior

        matched_pred_ious, matched_gt_inds = self.dynamic_k_matching(
            cost_matrix, pairwise_ious, num_gt, valid_mask)

        # convert to AssignResult format
        assigned_gt_inds[valid_mask] = matched_gt_inds + 1
        assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
        assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
        max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
                                                 -INF,
                                                 dtype=torch.float32)
        max_overlaps[valid_mask] = matched_pred_ious

        if assign_on_cpu:
            # num_gt = num_gt.to(device)
            assigned_gt_inds = assigned_gt_inds.to(device)
            max_overlaps = max_overlaps.to(device)
            assigned_labels = assigned_labels.to(device)
        return AssignResult(
            num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

    def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor,
                           num_gt: int,
                           valid_mask: Tensor) -> Tuple[Tensor, Tensor]:
        """Use IoU and matching cost to calculate the dynamic top-k positive
        targets. Same as SimOTA.

        Args:
            cost (Tensor): Cost matrix.
            pairwise_ious (Tensor): Pairwise iou matrix.
            num_gt (int): Number of gt.
            valid_mask (Tensor): Mask for valid bboxes.

        Returns:
            tuple: matched ious and gt indexes.
        """
        matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
        # select candidate topk ious for dynamic-k calculation
        candidate_topk = min(self.topk, pairwise_ious.size(0))
        topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
        # calculate dynamic k for each gt
        dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
        for gt_idx in range(num_gt):
            _, pos_idx = torch.topk(
                cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
            matching_matrix[:, gt_idx][pos_idx] = 1

        del topk_ious, dynamic_ks, pos_idx

        prior_match_gt_mask = matching_matrix.sum(1) > 1
        if prior_match_gt_mask.sum() > 0:
            cost_min, cost_argmin = torch.min(
                cost[prior_match_gt_mask, :], dim=1)
            matching_matrix[prior_match_gt_mask, :] *= 0
            matching_matrix[prior_match_gt_mask, cost_argmin] = 1
        # get foreground mask inside box and center prior
        fg_mask_inboxes = matching_matrix.sum(1) > 0
        valid_mask[valid_mask.clone()] = fg_mask_inboxes

        matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
        matched_pred_ious = (matching_matrix *
                             pairwise_ious).sum(1)[fg_mask_inboxes]
        return matched_pred_ious, matched_gt_inds
kingman1980 commented 11 months ago

The bug still exists, I got the same error during validation.

me too

20171758 commented 11 months ago

The bug still exists, I got the same error during validation.

mohamedrekik commented 10 months ago

I was facing the same issue and I was able to solve it in 2 ways:

  1. If the parameter fast_test is taking True as value, give it False instead.
  2. Decrease max_per_img parameter (in my case, 300 => 100).

Only one of these steps was enough for me.

20171758 commented 10 months ago

The bug still exists, I got the same error during validation.

During the verification process of training, the memory usage will increase significantly and eventually lead to OOM error, which is caused by the excessively large image resolution of my dataset (6240x4160). I successfully solved the problem by reducing the image resolution of the dataset (1660x1080).

eppane commented 6 months ago

I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.


@TASK_UTILS.register_module()
class DynamicSoftLabelAssigner(BaseAssigner):
    """Computes matching between predictions and ground truth with dynamic soft
    label assignment.

    Args:
        soft_center_radius (float): Radius of the soft center prior.
            Defaults to 3.0.
        topk (int): Select top-k predictions to calculate dynamic k
            best matches for each gt. Defaults to 13.
        iou_weight (float): The scale factor of iou cost. Defaults to 3.0.
        iou_calculator (ConfigType): Config of overlaps Calculator.
            Defaults to dict(type='BboxOverlaps2D').
    """

    def __init__(
            self,
            soft_center_radius: float = 3.0,
            topk: int = 13,
            iou_weight: float = 3.0,
            gpu_assign_thr: float = -1,
            iou_calculator: ConfigType = dict(type='BboxOverlaps2D')):

        self.soft_center_radius = soft_center_radius
        self.topk = topk
        self.iou_weight = iou_weight
        # ic(gpu_assign_thr)
        self.gpu_assign_thr = gpu_assign_thr
        self.iou_calculator = TASK_UTILS.build(iou_calculator)

    def assign(self,
               pred_instances: InstanceData,
               gt_instances: InstanceData,
               gt_instances_ignore: Optional[InstanceData] = None,
               **kwargs) -> AssignResult:
        """Assign gt to priors.

        Args:
            pred_instances (:obj:`InstanceData`): Instances of model
                predictions. It includes ``priors``, and the priors can
                be anchors or points, or the bboxes predicted by the
                previous stage, has shape (n, 4). The bboxes predicted by
                the current model or stage will be named ``bboxes``,
                ``labels``, and ``scores``, the same as the ``InstanceData``
                in other places.
            gt_instances (:obj:`InstanceData`): Ground truth of instance
                annotations. It usually includes ``bboxes``, with shape (k, 4),
                and ``labels``, with shape (k, ).
            gt_instances_ignore (:obj:`InstanceData`, optional): Instances
                to be ignored during training. It includes ``bboxes``
                attribute data that is ignored during training and testing.
                Defaults to None.
        Returns:
            obj:`AssignResult`: The assigned result.
        """
        gt_bboxes = gt_instances.bboxes
        gt_labels = gt_instances.labels
        num_gt = gt_bboxes.size(0)

        decoded_bboxes = pred_instances.bboxes
        pred_scores = pred_instances.scores
        priors = pred_instances.priors
        num_bboxes = decoded_bboxes.size(0)

        # ic(gt_bboxes.shape[0])
        # ic(self.gpu_assign_thr)

        assign_on_cpu = True if (self.gpu_assign_thr > 0) and (
            gt_bboxes.shape[0] > self.gpu_assign_thr) else False

        # ic(assign_on_cpu)

        # compute overlap and assign gt on CPU when number of GT is large
        if assign_on_cpu:
            # ic('assigning on cpu')
            device = priors.device
            priors = priors.cpu()
            gt_bboxes = gt_bboxes.cpu()
            gt_labels = gt_labels.cpu()
            decoded_bboxes = decoded_bboxes.cpu()
            pred_scores = pred_scores.cpu()

            # if gt_bboxes_ignore is not None:
            #     gt_bboxes_ignore = gt_bboxes_ignore.cpu()

        # assign 0 by default
        assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
                                                   0,
                                                   dtype=torch.long)
        if num_gt == 0 or num_bboxes == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            if num_gt == 0:
                # No truth, assign everything to background
                assigned_gt_inds[:] = 0
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

        prior_center = priors[:, :2]
        if isinstance(gt_bboxes, BaseBoxes):
            is_in_gts = gt_bboxes.find_inside_points(prior_center)
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            lt_ = prior_center[:, None] - gt_bboxes[:, :2]
            rb_ = gt_bboxes[:, 2:] - prior_center[:, None]

            deltas = torch.cat([lt_, rb_], dim=-1)
            is_in_gts = deltas.min(dim=-1).values > 0

        valid_mask = is_in_gts.sum(dim=1) > 0

        valid_decoded_bbox = decoded_bboxes[valid_mask]
        valid_pred_scores = pred_scores[valid_mask]
        num_valid = valid_decoded_bbox.size(0)

        if num_valid == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
        if hasattr(gt_instances, 'masks'):
            gt_center = center_of_mass(gt_instances.masks, eps=EPS)
        elif isinstance(gt_bboxes, BaseBoxes):
            gt_center = gt_bboxes.centers
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0
        valid_prior = priors[valid_mask]
        strides = valid_prior[:, 2]
        distance = (valid_prior[:, None, :2] - gt_center[None, :, :]
                    ).pow(2).sum(-1).sqrt() / strides[:, None]
        soft_center_prior = torch.pow(10, distance - self.soft_center_radius)

        pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes)
        iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight

        gt_onehot_label = (
            F.one_hot(gt_labels.to(torch.int64),
                      pred_scores.shape[-1]).float().unsqueeze(0).repeat(
                          num_valid, 1, 1))
        valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)

        soft_label = gt_onehot_label * pairwise_ious[..., None]
        scale_factor = soft_label - valid_pred_scores.sigmoid()
        soft_cls_cost = F.binary_cross_entropy_with_logits(
            valid_pred_scores, soft_label,
            reduction='none') * scale_factor.abs().pow(2.0)
        soft_cls_cost = soft_cls_cost.sum(dim=-1)

        cost_matrix = soft_cls_cost + iou_cost + soft_center_prior

        matched_pred_ious, matched_gt_inds = self.dynamic_k_matching(
            cost_matrix, pairwise_ious, num_gt, valid_mask)

        # convert to AssignResult format
        assigned_gt_inds[valid_mask] = matched_gt_inds + 1
        assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
        assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
        max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
                                                 -INF,
                                                 dtype=torch.float32)
        max_overlaps[valid_mask] = matched_pred_ious

        if assign_on_cpu:
            # num_gt = num_gt.to(device)
            assigned_gt_inds = assigned_gt_inds.to(device)
            max_overlaps = max_overlaps.to(device)
            assigned_labels = assigned_labels.to(device)
        return AssignResult(
            num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

    def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor,
                           num_gt: int,
                           valid_mask: Tensor) -> Tuple[Tensor, Tensor]:
        """Use IoU and matching cost to calculate the dynamic top-k positive
        targets. Same as SimOTA.

        Args:
            cost (Tensor): Cost matrix.
            pairwise_ious (Tensor): Pairwise iou matrix.
            num_gt (int): Number of gt.
            valid_mask (Tensor): Mask for valid bboxes.

        Returns:
            tuple: matched ious and gt indexes.
        """
        matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
        # select candidate topk ious for dynamic-k calculation
        candidate_topk = min(self.topk, pairwise_ious.size(0))
        topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
        # calculate dynamic k for each gt
        dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
        for gt_idx in range(num_gt):
            _, pos_idx = torch.topk(
                cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
            matching_matrix[:, gt_idx][pos_idx] = 1

        del topk_ious, dynamic_ks, pos_idx

        prior_match_gt_mask = matching_matrix.sum(1) > 1
        if prior_match_gt_mask.sum() > 0:
            cost_min, cost_argmin = torch.min(
                cost[prior_match_gt_mask, :], dim=1)
            matching_matrix[prior_match_gt_mask, :] *= 0
            matching_matrix[prior_match_gt_mask, cost_argmin] = 1
        # get foreground mask inside box and center prior
        fg_mask_inboxes = matching_matrix.sum(1) > 0
        valid_mask[valid_mask.clone()] = fg_mask_inboxes

        matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
        matched_pred_ious = (matching_matrix *
                             pairwise_ious).sum(1)[fg_mask_inboxes]
        return matched_pred_ious, matched_gt_inds

This solution slows down training quite significantly.

Tegala commented 3 months ago

When I use 4x12G (3070Ti) to train RTMDet-ins-cspnext-tiny, some GPUs have very low memory usage, while others are very saturated. image

echonax07 commented 3 months ago

I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.


@TASK_UTILS.register_module()
class DynamicSoftLabelAssigner(BaseAssigner):
    """Computes matching between predictions and ground truth with dynamic soft
    label assignment.

    Args:
        soft_center_radius (float): Radius of the soft center prior.
            Defaults to 3.0.
        topk (int): Select top-k predictions to calculate dynamic k
            best matches for each gt. Defaults to 13.
        iou_weight (float): The scale factor of iou cost. Defaults to 3.0.
        iou_calculator (ConfigType): Config of overlaps Calculator.
            Defaults to dict(type='BboxOverlaps2D').
    """

    def __init__(
            self,
            soft_center_radius: float = 3.0,
            topk: int = 13,
            iou_weight: float = 3.0,
            gpu_assign_thr: float = -1,
            iou_calculator: ConfigType = dict(type='BboxOverlaps2D')):

        self.soft_center_radius = soft_center_radius
        self.topk = topk
        self.iou_weight = iou_weight
        # ic(gpu_assign_thr)
        self.gpu_assign_thr = gpu_assign_thr
        self.iou_calculator = TASK_UTILS.build(iou_calculator)

    def assign(self,
               pred_instances: InstanceData,
               gt_instances: InstanceData,
               gt_instances_ignore: Optional[InstanceData] = None,
               **kwargs) -> AssignResult:
        """Assign gt to priors.

        Args:
            pred_instances (:obj:`InstanceData`): Instances of model
                predictions. It includes ``priors``, and the priors can
                be anchors or points, or the bboxes predicted by the
                previous stage, has shape (n, 4). The bboxes predicted by
                the current model or stage will be named ``bboxes``,
                ``labels``, and ``scores``, the same as the ``InstanceData``
                in other places.
            gt_instances (:obj:`InstanceData`): Ground truth of instance
                annotations. It usually includes ``bboxes``, with shape (k, 4),
                and ``labels``, with shape (k, ).
            gt_instances_ignore (:obj:`InstanceData`, optional): Instances
                to be ignored during training. It includes ``bboxes``
                attribute data that is ignored during training and testing.
                Defaults to None.
        Returns:
            obj:`AssignResult`: The assigned result.
        """
        gt_bboxes = gt_instances.bboxes
        gt_labels = gt_instances.labels
        num_gt = gt_bboxes.size(0)

        decoded_bboxes = pred_instances.bboxes
        pred_scores = pred_instances.scores
        priors = pred_instances.priors
        num_bboxes = decoded_bboxes.size(0)

        # ic(gt_bboxes.shape[0])
        # ic(self.gpu_assign_thr)

        assign_on_cpu = True if (self.gpu_assign_thr > 0) and (
            gt_bboxes.shape[0] > self.gpu_assign_thr) else False

        # ic(assign_on_cpu)

        # compute overlap and assign gt on CPU when number of GT is large
        if assign_on_cpu:
            # ic('assigning on cpu')
            device = priors.device
            priors = priors.cpu()
            gt_bboxes = gt_bboxes.cpu()
            gt_labels = gt_labels.cpu()
            decoded_bboxes = decoded_bboxes.cpu()
            pred_scores = pred_scores.cpu()

            # if gt_bboxes_ignore is not None:
            #     gt_bboxes_ignore = gt_bboxes_ignore.cpu()

        # assign 0 by default
        assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
                                                   0,
                                                   dtype=torch.long)
        if num_gt == 0 or num_bboxes == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            if num_gt == 0:
                # No truth, assign everything to background
                assigned_gt_inds[:] = 0
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

        prior_center = priors[:, :2]
        if isinstance(gt_bboxes, BaseBoxes):
            is_in_gts = gt_bboxes.find_inside_points(prior_center)
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            lt_ = prior_center[:, None] - gt_bboxes[:, :2]
            rb_ = gt_bboxes[:, 2:] - prior_center[:, None]

            deltas = torch.cat([lt_, rb_], dim=-1)
            is_in_gts = deltas.min(dim=-1).values > 0

        valid_mask = is_in_gts.sum(dim=1) > 0

        valid_decoded_bbox = decoded_bboxes[valid_mask]
        valid_pred_scores = pred_scores[valid_mask]
        num_valid = valid_decoded_bbox.size(0)

        if num_valid == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
        if hasattr(gt_instances, 'masks'):
            gt_center = center_of_mass(gt_instances.masks, eps=EPS)
        elif isinstance(gt_bboxes, BaseBoxes):
            gt_center = gt_bboxes.centers
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0
        valid_prior = priors[valid_mask]
        strides = valid_prior[:, 2]
        distance = (valid_prior[:, None, :2] - gt_center[None, :, :]
                    ).pow(2).sum(-1).sqrt() / strides[:, None]
        soft_center_prior = torch.pow(10, distance - self.soft_center_radius)

        pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes)
        iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight

        gt_onehot_label = (
            F.one_hot(gt_labels.to(torch.int64),
                      pred_scores.shape[-1]).float().unsqueeze(0).repeat(
                          num_valid, 1, 1))
        valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)

        soft_label = gt_onehot_label * pairwise_ious[..., None]
        scale_factor = soft_label - valid_pred_scores.sigmoid()
        soft_cls_cost = F.binary_cross_entropy_with_logits(
            valid_pred_scores, soft_label,
            reduction='none') * scale_factor.abs().pow(2.0)
        soft_cls_cost = soft_cls_cost.sum(dim=-1)

        cost_matrix = soft_cls_cost + iou_cost + soft_center_prior

        matched_pred_ious, matched_gt_inds = self.dynamic_k_matching(
            cost_matrix, pairwise_ious, num_gt, valid_mask)

        # convert to AssignResult format
        assigned_gt_inds[valid_mask] = matched_gt_inds + 1
        assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
        assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
        max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
                                                 -INF,
                                                 dtype=torch.float32)
        max_overlaps[valid_mask] = matched_pred_ious

        if assign_on_cpu:
            # num_gt = num_gt.to(device)
            assigned_gt_inds = assigned_gt_inds.to(device)
            max_overlaps = max_overlaps.to(device)
            assigned_labels = assigned_labels.to(device)
        return AssignResult(
            num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

    def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor,
                           num_gt: int,
                           valid_mask: Tensor) -> Tuple[Tensor, Tensor]:
        """Use IoU and matching cost to calculate the dynamic top-k positive
        targets. Same as SimOTA.

        Args:
            cost (Tensor): Cost matrix.
            pairwise_ious (Tensor): Pairwise iou matrix.
            num_gt (int): Number of gt.
            valid_mask (Tensor): Mask for valid bboxes.

        Returns:
            tuple: matched ious and gt indexes.
        """
        matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
        # select candidate topk ious for dynamic-k calculation
        candidate_topk = min(self.topk, pairwise_ious.size(0))
        topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
        # calculate dynamic k for each gt
        dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
        for gt_idx in range(num_gt):
            _, pos_idx = torch.topk(
                cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
            matching_matrix[:, gt_idx][pos_idx] = 1

        del topk_ious, dynamic_ks, pos_idx

        prior_match_gt_mask = matching_matrix.sum(1) > 1
        if prior_match_gt_mask.sum() > 0:
            cost_min, cost_argmin = torch.min(
                cost[prior_match_gt_mask, :], dim=1)
            matching_matrix[prior_match_gt_mask, :] *= 0
            matching_matrix[prior_match_gt_mask, cost_argmin] = 1
        # get foreground mask inside box and center prior
        fg_mask_inboxes = matching_matrix.sum(1) > 0
        valid_mask[valid_mask.clone()] = fg_mask_inboxes

        matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
        matched_pred_ious = (matching_matrix *
                             pairwise_ious).sum(1)[fg_mask_inboxes]
        return matched_pred_ious, matched_gt_inds

This solution slows down training quite significantly.

Certainly, the calculations are being performed on the CPU, which is relatively slow. As an alternative, I implemented a try-except block to handle CUDA out-of-memory (OOM) errors. So you assign on CPU only when you run out of memory on GPU

ykaganov commented 1 month ago

I found that you can solve this issue by limiting the number of bounding-boxes used per image. In mmyolo/data/transforms.py add the following class to randomly pick a maximum number of boxes each time the image is loaded:

@TRANSFORMS.register_module()
class LimitBBoxes:
    def __init__(self, max_bboxes):
        self.max_bboxes = max_bboxes

    def __call__(self, results):
        num_bboxes = len(results['gt_bboxes'])
        if num_bboxes > self.max_bboxes:
            indices = np.random.choice(num_bboxes, self.max_bboxes, replace=False)
            results['gt_bboxes'] = results['gt_bboxes'][indices]
            if 'gt_ignore_flags' in results:
                results['gt_ignore_flags'] = results['gt_ignore_flags'][indices]
            if 'gt_bboxes_labels' in results:
                results['gt_bboxes_labels'] = results['gt_bboxes_labels'][indices]
            if 'gt_labels' in results:
                results['gt_labels'] = results['gt_labels'][indices]
        return results

Also add this new class to the init.py file in "transforms" folder, and finally add to config like so:

train_pipeline = [
    dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='LimitBBoxes', max_bboxes=10),
........
big-gandalf commented 1 month ago

I found that you can solve this issue by limiting the number of bounding-boxes used per image. In mmyolo/data/transforms.py add the following class to randomly pick a maximum number of boxes each time the image is loaded:

@TRANSFORMS.register_module()
class LimitBBoxes:
    def __init__(self, max_bboxes):
        self.max_bboxes = max_bboxes

    def __call__(self, results):
        num_bboxes = len(results['gt_bboxes'])
        if num_bboxes > self.max_bboxes:
            indices = np.random.choice(num_bboxes, self.max_bboxes, replace=False)
            results['gt_bboxes'] = results['gt_bboxes'][indices]
            if 'gt_ignore_flags' in results:
                results['gt_ignore_flags'] = results['gt_ignore_flags'][indices]
            if 'gt_bboxes_labels' in results:
                results['gt_bboxes_labels'] = results['gt_bboxes_labels'][indices]
            if 'gt_labels' in results:
                results['gt_labels'] = results['gt_labels'][indices]
        return results

Also add this new class to the init.py file in "transforms" folder, and finally add to config like so:

train_pipeline = [
    dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='LimitBBoxes', max_bboxes=10),
........

Is this transform also reducing the masks number ?

ykaganov commented 1 month ago

@big-gandalf No, in my case I did not use the masks. You can easily add a limit for the masks in a similar way, by replacing the 'gt_bboxes' key with the key for the masks.

big-gandalf commented 1 month ago

@big-gandalf No, in my case I did not use the masks. You can easily add a limit for the masks in a similar way, by replacing the 'gt_bboxes' key with the key for the masks.

Actually I am little confused. I am trying to implement your solution for my rtmdet instance segmentation model. In my case I don't interest the bboxes. I don't want to reduce the number of masks. Does this solution help for my case?

ykaganov commented 1 month ago

@big-gandalf No, in my case I did not use the masks. You can easily add a limit for the masks in a similar way, by replacing the 'gt_bboxes' key with the key for the masks.

Actually I am little confused. I am trying to implement your solution for my rtmdet instance segmentation model. In my case I don't interest the bboxes. I don't want to reduce the number of masks. Does this solution help for my case?

If you are using instance segmentation it's the same thing, the code will crash if the number of instances (masks) per image is too high, so you need to apply the same solution.