Open mira-murali opened 1 year ago
Thanks for your bug report! We are working on optimizing the memory footprint of RTMDet.
Okay, for now, does that mean none of the suggestions in the FAQ apply for RTMDet?
Try to add '--amp' to enable fp16 training.
I have also noticed this increasing memory trait in the first and second epoch.
have this problem solved?
+1 same problem with fp16 training, always OOM in the middle of training.
I don't think it's been solved but I haven't checked the latest updates. fp16 training didn't work for me either. I eventually just ended up using an AWS instance with a higher GPU memory and a lower batch size to be able to train.
I don't think it's been solved but I haven't checked the latest updates. fp16 training didn't work for me either. I eventually just ended up using an AWS instance with a higher GPU memory and a lower batch size to be able to train.
maybe add "@torch.no_grad()" can solve your problem(?) https://github.com/open-mmlab/mmdetection/blob/61dd8d518b13c7ee4bdf609595b7e803f3ac0224/mmdet/models/task_modules/assigners/dynamic_soft_label_assigner.py#L66
In my case, this problem is solved by adding "@AvoidCUDAOOM.retry_if_cuda_oom" to loss_mask_by_feat function as well as adding a max_mask_to_train limitation to constrain the number of masks which is fed to loss module(similar to YOLACT implementation). I am not sure which modification works.
I get this error but only on the validation steps. Even if I set the batch size to 4 and use 30% GPU RAM (4090 RTX 24GB) the memory is stable but then during validation steps the GPU memory is wildly varying.
I only need the one mask per image so if this is the cause does anyone know how I set this to be a much lower value on the validation steps?
@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn't seem to make a difference for me.
@qwert31639 I added @torch.no_grad() but it also doesn't seem to make much of a difference.
I am trying to run on multiple (24 GB) GPUs using /mmdetection/tools/dist_train.sh
and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?
I have the same error of CUDA memory error during validation (training finishes fine, with only ~30% memory occupied) with a single GPU setting
same error
I have also noticed this increasing memory trait in the first and second epoch.
Me too. And when I add --amp
param, nan
loss appear in log.
@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn't seem to make a difference for me.
@qwert31639 I added @torch.no_grad() but it also doesn't seem to make much of a difference.
I am trying to run on multiple (24 GB) GPUs using
/mmdetection/tools/dist_train.sh
and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?
You can refer to the implementation of Condinst. Simply put, only a subset of masks prediction which are randomly selected from the positive samples will be used to compute loss. I also refer to max_iou_assigner and add a gpu assign threshold to dynamic_soft_label_assigner in order to prevent cuda OOM during label assignment. I think the OOM problem is most likely to occur during the label assignment and loss back propagation. @AvoidCUDAOOM.retry_if_cuda_oom can not convert the InstanceData format input to fp16 when OOM occurs so it won't work.
I have also noticed this increasing memory trait in the first and second epoch.
Me too. And when I add
--amp
param,nan
loss appear in log.
I also encountered this problem. It seems that the loss of RTMDet-ins may exceed the range of fp16 during training and then becomes nan. So I just turned-off the amp mode.
I have two RTMDet-Ins projects that both experience a CUDA OOM error. The smaller project uses ~95% of memory for several hundred iterations, then eventually runs out of memory in the dice loss computation. The larger project runs out after just 10 or fewer iterations a little earlier in loss_mask_by_feat
.
I tried putting the @AvoidCUDAOOM.retry_if_cuda_oom
decorator on loss_mask_by_feat. This seems to have resolved the issue for the smaller project, but not the larger one. All of the following stack traces are for the larger project:
Here is the original error message:
Traceback (most recent call last):
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
runner.train()
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
self.run_iter(data_batch)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 632, in loss_mask_by_feat
pos_gt_masks = torch.cat(pos_gt_masks, 0)
RuntimeError: CUDA out of memory. Tried to allocate 2.56 GiB (GPU 0; 14.61 GiB total capacity; 8.73 GiB already allocated; 2.10 GiB free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
After putting @AvoidCUDAOOM.retry_if_cuda_oom
on loss_mask_by_feat:
06/15 16:48:43 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7efd15947ca0> to FP16 due to CUDA OOM
...
Traceback (most recent call last):
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
ret = record(fn)(*args_)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
runner.train()
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
self.run_iter(data_batch)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 172, in wrapped
output = func(*fp16_args, **fp16_kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 622, in loss_mask_by_feat
pos_mask_logits = self._mask_predict_by_feat_single(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 586, in _mask_predict_by_feat_single
x = F.conv2d(
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
When I move the decorator up to loss_by_feat, I get the following error:
Traceback (most recent call last):
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
ret = record(fn)(*args_)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
runner.train()
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
self.run_iter(data_batch)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 148, in wrapped
return func(*args, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 720, in loss_by_feat
gt_instances.masks = gt_instances.masks.to_tensor(
AttributeError: 'Tensor' object has no attribute 'to_tensor'
I moved the decorator back to loss_mask_by_feat and commented out the part of retry_if_cuda_oom that does fp16 conversion so it skips straight to using the CPU.
06/15 16:52:23 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7f4ebd75cc10> to CPU due to CUDA OOM
06/15 16:52:23 - mmengine - WARNING - Convert outputs to GPU (device=cuda:0)
...
Traceback (most recent call last):
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
ret = record(fn)(*args_)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
runner.train()
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
self.run_iter(data_batch)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 192, in wrapped
output = func(*cpu_args, **cpu_kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 622, in loss_mask_by_feat
pos_mask_logits = self._mask_predict_by_feat_single(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 574, in _mask_predict_by_feat_single
relative_coord = (points - coord).permute(0, 2, 1) / (
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
I returned the retry_if_cuda_oom
code back to normal and changed my optimizer type to AmpOptimWrapper
. I can see the OOM triggered conversion to FP16 which didn't fail this time but did OOM, leading to moving inputs to the CPU which then failed. Training lasted a few hundred iterations this time, which is a lot longer than the ~10 iterations from before. I'm not sure if this is a coincidence or not.
06/15 17:18:40 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7fcbf4c61c10> to FP16 due to CUDA OOM
06/15 17:18:41 - mmengine - WARNING - Using FP16 still meet CUDA OOM
06/15 17:18:41 - mmengine - WARNING - Attempting to copy inputs of <function RTMDetInsHead.loss_mask_by_feat at 0x7fcbf4c61c10> to CPU due to CUDA OOM
06/15 17:18:41 - mmengine - WARNING - Convert outputs to GPU (device=cuda:0)
...
Traceback (most recent call last):
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 382, in _wrap
ret = record(fn)(*args_)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/biodock/python-queue/biodock_libs/models/biodock/models/openmmlab/trainers.py", line 164, in main_per_gpu
runner.train()
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 279, in run
self.run_iter(data_batch)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/runner/loops.py", line 302, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 123, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 163, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 751, in loss_by_feat
loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/utils/memory.py", line 191, in wrapped
output = func(*cpu_args, **cpu_kwargs)
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 622, in loss_mask_by_feat
pos_mask_logits = self._mask_predict_by_feat_single(
File "/opt/conda/envs/mmyolo-jkelle/lib/python3.8/site-packages/mmdet/models/dense_heads/rtmdet_ins_head.py", line 574, in _mask_predict_by_feat_single
relative_coord = (points - coord).permute(0, 2, 1) / (
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
It seems to me the implementation of loss_mask_by_feat
is not compatible with AvoidCUDAOOM.retry_if_cuda_oom
.
I made a workaround by creating a new type of Sampler that limits the number of positive detections. I basically copied code from the PseudoSampler and RandomSampler classes.
@TASK_UTILS.register_module()
class CapNumPosSampler(BaseSampler):
def __init__(self, max_num_pos: int, **kwargs):
self.max_num_pos = max_num_pos
def _sample_neg(self, **kwargs):
raise NotImplementedError
def random_choice(self, gallery: Tensor, num: int) -> Tensor:
"""Random select some elements from the gallery.
Args:
gallery (Tensor): indices pool.
num (int): expected sample num.
Returns:
Tensor: sampled indices.
"""
assert len(gallery) >= num
is_tensor = isinstance(gallery, torch.Tensor)
assert is_tensor, 'Only support Tensor now, got {}'.format(type(gallery))
if not is_tensor:
if torch.cuda.is_available():
device = torch.cuda.current_device()
else:
device = 'cpu'
gallery = torch.tensor(gallery, dtype=torch.long, device=device)
# This is a temporary fix. We can revert the following code
# when PyTorch fixes the abnormal return of torch.randperm.
# See: https://github.com/open-mmlab/mmdetection/pull/5014
perm = torch.randperm(gallery.numel())[:num].to(device=gallery.device)
rand_inds = gallery[perm]
return rand_inds
def _sample_pos(self, assign_result: AssignResult, num_expected: int) -> Tensor:
"""Randomly sample some positive samples.
Args:
assign_result (:obj:`AssignResult`): Bbox assigning results.
num_expected (int): The number of expected positive samples
Returns:
Tensor or ndarray: sampled indices.
"""
pos_inds = torch.nonzero(assign_result.gt_inds > 0, as_tuple=False)
if pos_inds.numel() != 0:
pos_inds = pos_inds.squeeze(1)
if pos_inds.numel() <= num_expected:
return pos_inds
else:
return self.random_choice(pos_inds, num_expected)
def sample(self, assign_result: AssignResult, pred_instances: InstanceData,
gt_instances: InstanceData, *args, **kwargs):
"""Directly returns the positive and negative indices of samples.
Args:
assign_result (:obj:`AssignResult`): Bbox assigning results.
pred_instances (:obj:`InstanceData`): Instances of model
predictions. It includes ``priors``, and the priors can
be anchors, points, or bboxes predicted by the model,
shape(n, 4).
gt_instances (:obj:`InstanceData`): Ground truth of instance
annotations. It usually includes ``bboxes`` and ``labels``
attributes.
Returns:
:obj:`SamplingResult`: sampler results
"""
gt_bboxes = gt_instances.bboxes
priors = pred_instances.priors
pos_inds = self._sample_pos(assign_result, self.max_num_pos).unique()
neg_inds = torch.nonzero(
assign_result.gt_inds == 0, as_tuple=False).squeeze(-1).unique()
gt_flags = priors.new_zeros(priors.shape[0], dtype=torch.uint8)
sampling_result = SamplingResult(
pos_inds=pos_inds,
neg_inds=neg_inds,
priors=priors,
gt_bboxes=gt_bboxes,
assign_result=assign_result,
gt_flags=gt_flags,
avg_factor_with_neg=False)
return sampling_result
Then in my config:
...
train_cfg=dict(
sampler=dict(type="CapNumPosSampler", max_num_pos=2000),
...
),
...
custom_imports = dict(
imports=[
"cap_num_pos_sampler.py",
],
allow_failed_imports=False,
)
Then when you run training, make sure to load the custom module:
from mmengine.utils import import_modules_from_strings
import_modules_from_strings(**cfg["custom_imports"])
This bug also exists in mmyolo.
For those who got the OOM error during validation, I've found one problem. During validation, the inference input will go through the val_pipeline, which contains the 'Resize'. So it's ok during inference. But in post-processing, it will interpolate the output mask to the original image size, then run sigmoid and thresholding to get the mask output. Refer to the code snippet bellow. https://github.com/open-mmlab/mmdetection/blob/f78af7785ada87f1ced75a2313746e4ba3149760/mmdet/models/dense_heads/rtmdet_ins_head.py#L498-L510 This could be extremely memory costing if your original image has a large resolution. e.g. a 4000x3000 image will have a mask output tensor of 100x4000x3000, which will cost over 4G memory!(and it's just a single tensor, there could be several temporary tensors with same size) I haven't found an effective solution yet. If you set the 'rescale' parameter to false, then the output mask won't be scaled to match the original image size. This will lead to wrong metric calculation. I tried to put the sigmoid before the interpolation, which do save some memory but not much. I think one solution would be to set the 'rescale' parameter to false, and when calculating validation metrics, resize the original image to match the output mask size.
@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn't seem to make a difference for me. @qwert31639 I added @torch.no_grad() but it also doesn't seem to make much of a difference. I am trying to run on multiple (24 GB) GPUs using
/mmdetection/tools/dist_train.sh
and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?You can refer to the implementation of Condinst. Simply put, only a subset of masks prediction which are randomly selected from the positive samples will be used to compute loss. I also refer to max_iou_assigner and add a gpu assign threshold to dynamic_soft_label_assigner in order to prevent cuda OOM during label assignment. I think the OOM problem is most likely to occur during the label assignment and loss back propagation. @AvoidCUDAOOM.retry_if_cuda_oom can not convert the InstanceData format input to fp16 when OOM occurs so it won't work.
I have also noticed this increasing memory trait in the first and second epoch.
Me too. And when I add
--amp
param,nan
loss appear in log.I also encountered this problem. It seems that the loss of RTMDet-ins may exceed the range of fp16 during training and then becomes nan. So I just turned-off the amp mode.
@SimonGuoNjust Can you pls share your implementation of gpu_assign_thr in DynamicSoftLabelAssigner.
Thanks,
The bug still exists, I got the same error during validation.
I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.
@TASK_UTILS.register_module()
class DynamicSoftLabelAssigner(BaseAssigner):
"""Computes matching between predictions and ground truth with dynamic soft
label assignment.
Args:
soft_center_radius (float): Radius of the soft center prior.
Defaults to 3.0.
topk (int): Select top-k predictions to calculate dynamic k
best matches for each gt. Defaults to 13.
iou_weight (float): The scale factor of iou cost. Defaults to 3.0.
iou_calculator (ConfigType): Config of overlaps Calculator.
Defaults to dict(type='BboxOverlaps2D').
"""
def __init__(
self,
soft_center_radius: float = 3.0,
topk: int = 13,
iou_weight: float = 3.0,
gpu_assign_thr: float = -1,
iou_calculator: ConfigType = dict(type='BboxOverlaps2D')):
self.soft_center_radius = soft_center_radius
self.topk = topk
self.iou_weight = iou_weight
# ic(gpu_assign_thr)
self.gpu_assign_thr = gpu_assign_thr
self.iou_calculator = TASK_UTILS.build(iou_calculator)
def assign(self,
pred_instances: InstanceData,
gt_instances: InstanceData,
gt_instances_ignore: Optional[InstanceData] = None,
**kwargs) -> AssignResult:
"""Assign gt to priors.
Args:
pred_instances (:obj:`InstanceData`): Instances of model
predictions. It includes ``priors``, and the priors can
be anchors or points, or the bboxes predicted by the
previous stage, has shape (n, 4). The bboxes predicted by
the current model or stage will be named ``bboxes``,
``labels``, and ``scores``, the same as the ``InstanceData``
in other places.
gt_instances (:obj:`InstanceData`): Ground truth of instance
annotations. It usually includes ``bboxes``, with shape (k, 4),
and ``labels``, with shape (k, ).
gt_instances_ignore (:obj:`InstanceData`, optional): Instances
to be ignored during training. It includes ``bboxes``
attribute data that is ignored during training and testing.
Defaults to None.
Returns:
obj:`AssignResult`: The assigned result.
"""
gt_bboxes = gt_instances.bboxes
gt_labels = gt_instances.labels
num_gt = gt_bboxes.size(0)
decoded_bboxes = pred_instances.bboxes
pred_scores = pred_instances.scores
priors = pred_instances.priors
num_bboxes = decoded_bboxes.size(0)
# ic(gt_bboxes.shape[0])
# ic(self.gpu_assign_thr)
assign_on_cpu = True if (self.gpu_assign_thr > 0) and (
gt_bboxes.shape[0] > self.gpu_assign_thr) else False
# ic(assign_on_cpu)
# compute overlap and assign gt on CPU when number of GT is large
if assign_on_cpu:
# ic('assigning on cpu')
device = priors.device
priors = priors.cpu()
gt_bboxes = gt_bboxes.cpu()
gt_labels = gt_labels.cpu()
decoded_bboxes = decoded_bboxes.cpu()
pred_scores = pred_scores.cpu()
# if gt_bboxes_ignore is not None:
# gt_bboxes_ignore = gt_bboxes_ignore.cpu()
# assign 0 by default
assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
0,
dtype=torch.long)
if num_gt == 0 or num_bboxes == 0:
# No ground truth or boxes, return empty assignment
max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
if num_gt == 0:
# No truth, assign everything to background
assigned_gt_inds[:] = 0
assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
-1,
dtype=torch.long)
if assign_on_cpu:
# num_gt = num_gt.to(device)
assigned_gt_inds = assigned_gt_inds.to(device)
max_overlaps = max_overlaps.to(device)
assigned_labels = assigned_labels.to(device)
return AssignResult(
num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
prior_center = priors[:, :2]
if isinstance(gt_bboxes, BaseBoxes):
is_in_gts = gt_bboxes.find_inside_points(prior_center)
else:
# Tensor boxes will be treated as horizontal boxes by defaults
lt_ = prior_center[:, None] - gt_bboxes[:, :2]
rb_ = gt_bboxes[:, 2:] - prior_center[:, None]
deltas = torch.cat([lt_, rb_], dim=-1)
is_in_gts = deltas.min(dim=-1).values > 0
valid_mask = is_in_gts.sum(dim=1) > 0
valid_decoded_bbox = decoded_bboxes[valid_mask]
valid_pred_scores = pred_scores[valid_mask]
num_valid = valid_decoded_bbox.size(0)
if num_valid == 0:
# No ground truth or boxes, return empty assignment
max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
-1,
dtype=torch.long)
if assign_on_cpu:
# num_gt = num_gt.to(device)
assigned_gt_inds = assigned_gt_inds.to(device)
max_overlaps = max_overlaps.to(device)
assigned_labels = assigned_labels.to(device)
return AssignResult(
num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
if hasattr(gt_instances, 'masks'):
gt_center = center_of_mass(gt_instances.masks, eps=EPS)
elif isinstance(gt_bboxes, BaseBoxes):
gt_center = gt_bboxes.centers
else:
# Tensor boxes will be treated as horizontal boxes by defaults
gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0
valid_prior = priors[valid_mask]
strides = valid_prior[:, 2]
distance = (valid_prior[:, None, :2] - gt_center[None, :, :]
).pow(2).sum(-1).sqrt() / strides[:, None]
soft_center_prior = torch.pow(10, distance - self.soft_center_radius)
pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes)
iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight
gt_onehot_label = (
F.one_hot(gt_labels.to(torch.int64),
pred_scores.shape[-1]).float().unsqueeze(0).repeat(
num_valid, 1, 1))
valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)
soft_label = gt_onehot_label * pairwise_ious[..., None]
scale_factor = soft_label - valid_pred_scores.sigmoid()
soft_cls_cost = F.binary_cross_entropy_with_logits(
valid_pred_scores, soft_label,
reduction='none') * scale_factor.abs().pow(2.0)
soft_cls_cost = soft_cls_cost.sum(dim=-1)
cost_matrix = soft_cls_cost + iou_cost + soft_center_prior
matched_pred_ious, matched_gt_inds = self.dynamic_k_matching(
cost_matrix, pairwise_ious, num_gt, valid_mask)
# convert to AssignResult format
assigned_gt_inds[valid_mask] = matched_gt_inds + 1
assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
-INF,
dtype=torch.float32)
max_overlaps[valid_mask] = matched_pred_ious
if assign_on_cpu:
# num_gt = num_gt.to(device)
assigned_gt_inds = assigned_gt_inds.to(device)
max_overlaps = max_overlaps.to(device)
assigned_labels = assigned_labels.to(device)
return AssignResult(
num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor,
num_gt: int,
valid_mask: Tensor) -> Tuple[Tensor, Tensor]:
"""Use IoU and matching cost to calculate the dynamic top-k positive
targets. Same as SimOTA.
Args:
cost (Tensor): Cost matrix.
pairwise_ious (Tensor): Pairwise iou matrix.
num_gt (int): Number of gt.
valid_mask (Tensor): Mask for valid bboxes.
Returns:
tuple: matched ious and gt indexes.
"""
matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
# select candidate topk ious for dynamic-k calculation
candidate_topk = min(self.topk, pairwise_ious.size(0))
topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
# calculate dynamic k for each gt
dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
for gt_idx in range(num_gt):
_, pos_idx = torch.topk(
cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
matching_matrix[:, gt_idx][pos_idx] = 1
del topk_ious, dynamic_ks, pos_idx
prior_match_gt_mask = matching_matrix.sum(1) > 1
if prior_match_gt_mask.sum() > 0:
cost_min, cost_argmin = torch.min(
cost[prior_match_gt_mask, :], dim=1)
matching_matrix[prior_match_gt_mask, :] *= 0
matching_matrix[prior_match_gt_mask, cost_argmin] = 1
# get foreground mask inside box and center prior
fg_mask_inboxes = matching_matrix.sum(1) > 0
valid_mask[valid_mask.clone()] = fg_mask_inboxes
matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
matched_pred_ious = (matching_matrix *
pairwise_ious).sum(1)[fg_mask_inboxes]
return matched_pred_ious, matched_gt_inds
The bug still exists, I got the same error during validation.
me too
The bug still exists, I got the same error during validation.
I was facing the same issue and I was able to solve it in 2 ways:
fast_test
is taking True
as value, give it False
instead.max_per_img
parameter (in my case, 300 => 100).Only one of these steps was enough for me.
The bug still exists, I got the same error during validation.
During the verification process of training, the memory usage will increase significantly and eventually lead to OOM error, which is caused by the excessively large image resolution of my dataset (6240x4160). I successfully solved the problem by reducing the image resolution of the dataset (1660x1080).
I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.
@TASK_UTILS.register_module() class DynamicSoftLabelAssigner(BaseAssigner): """Computes matching between predictions and ground truth with dynamic soft label assignment. Args: soft_center_radius (float): Radius of the soft center prior. Defaults to 3.0. topk (int): Select top-k predictions to calculate dynamic k best matches for each gt. Defaults to 13. iou_weight (float): The scale factor of iou cost. Defaults to 3.0. iou_calculator (ConfigType): Config of overlaps Calculator. Defaults to dict(type='BboxOverlaps2D'). """ def __init__( self, soft_center_radius: float = 3.0, topk: int = 13, iou_weight: float = 3.0, gpu_assign_thr: float = -1, iou_calculator: ConfigType = dict(type='BboxOverlaps2D')): self.soft_center_radius = soft_center_radius self.topk = topk self.iou_weight = iou_weight # ic(gpu_assign_thr) self.gpu_assign_thr = gpu_assign_thr self.iou_calculator = TASK_UTILS.build(iou_calculator) def assign(self, pred_instances: InstanceData, gt_instances: InstanceData, gt_instances_ignore: Optional[InstanceData] = None, **kwargs) -> AssignResult: """Assign gt to priors. Args: pred_instances (:obj:`InstanceData`): Instances of model predictions. It includes ``priors``, and the priors can be anchors or points, or the bboxes predicted by the previous stage, has shape (n, 4). The bboxes predicted by the current model or stage will be named ``bboxes``, ``labels``, and ``scores``, the same as the ``InstanceData`` in other places. gt_instances (:obj:`InstanceData`): Ground truth of instance annotations. It usually includes ``bboxes``, with shape (k, 4), and ``labels``, with shape (k, ). gt_instances_ignore (:obj:`InstanceData`, optional): Instances to be ignored during training. It includes ``bboxes`` attribute data that is ignored during training and testing. Defaults to None. Returns: obj:`AssignResult`: The assigned result. """ gt_bboxes = gt_instances.bboxes gt_labels = gt_instances.labels num_gt = gt_bboxes.size(0) decoded_bboxes = pred_instances.bboxes pred_scores = pred_instances.scores priors = pred_instances.priors num_bboxes = decoded_bboxes.size(0) # ic(gt_bboxes.shape[0]) # ic(self.gpu_assign_thr) assign_on_cpu = True if (self.gpu_assign_thr > 0) and ( gt_bboxes.shape[0] > self.gpu_assign_thr) else False # ic(assign_on_cpu) # compute overlap and assign gt on CPU when number of GT is large if assign_on_cpu: # ic('assigning on cpu') device = priors.device priors = priors.cpu() gt_bboxes = gt_bboxes.cpu() gt_labels = gt_labels.cpu() decoded_bboxes = decoded_bboxes.cpu() pred_scores = pred_scores.cpu() # if gt_bboxes_ignore is not None: # gt_bboxes_ignore = gt_bboxes_ignore.cpu() # assign 0 by default assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ), 0, dtype=torch.long) if num_gt == 0 or num_bboxes == 0: # No ground truth or boxes, return empty assignment max_overlaps = decoded_bboxes.new_zeros((num_bboxes, )) if num_gt == 0: # No truth, assign everything to background assigned_gt_inds[:] = 0 assigned_labels = decoded_bboxes.new_full((num_bboxes, ), -1, dtype=torch.long) if assign_on_cpu: # num_gt = num_gt.to(device) assigned_gt_inds = assigned_gt_inds.to(device) max_overlaps = max_overlaps.to(device) assigned_labels = assigned_labels.to(device) return AssignResult( num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels) prior_center = priors[:, :2] if isinstance(gt_bboxes, BaseBoxes): is_in_gts = gt_bboxes.find_inside_points(prior_center) else: # Tensor boxes will be treated as horizontal boxes by defaults lt_ = prior_center[:, None] - gt_bboxes[:, :2] rb_ = gt_bboxes[:, 2:] - prior_center[:, None] deltas = torch.cat([lt_, rb_], dim=-1) is_in_gts = deltas.min(dim=-1).values > 0 valid_mask = is_in_gts.sum(dim=1) > 0 valid_decoded_bbox = decoded_bboxes[valid_mask] valid_pred_scores = pred_scores[valid_mask] num_valid = valid_decoded_bbox.size(0) if num_valid == 0: # No ground truth or boxes, return empty assignment max_overlaps = decoded_bboxes.new_zeros((num_bboxes, )) assigned_labels = decoded_bboxes.new_full((num_bboxes, ), -1, dtype=torch.long) if assign_on_cpu: # num_gt = num_gt.to(device) assigned_gt_inds = assigned_gt_inds.to(device) max_overlaps = max_overlaps.to(device) assigned_labels = assigned_labels.to(device) return AssignResult( num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels) if hasattr(gt_instances, 'masks'): gt_center = center_of_mass(gt_instances.masks, eps=EPS) elif isinstance(gt_bboxes, BaseBoxes): gt_center = gt_bboxes.centers else: # Tensor boxes will be treated as horizontal boxes by defaults gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0 valid_prior = priors[valid_mask] strides = valid_prior[:, 2] distance = (valid_prior[:, None, :2] - gt_center[None, :, :] ).pow(2).sum(-1).sqrt() / strides[:, None] soft_center_prior = torch.pow(10, distance - self.soft_center_radius) pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes) iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight gt_onehot_label = ( F.one_hot(gt_labels.to(torch.int64), pred_scores.shape[-1]).float().unsqueeze(0).repeat( num_valid, 1, 1)) valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1) soft_label = gt_onehot_label * pairwise_ious[..., None] scale_factor = soft_label - valid_pred_scores.sigmoid() soft_cls_cost = F.binary_cross_entropy_with_logits( valid_pred_scores, soft_label, reduction='none') * scale_factor.abs().pow(2.0) soft_cls_cost = soft_cls_cost.sum(dim=-1) cost_matrix = soft_cls_cost + iou_cost + soft_center_prior matched_pred_ious, matched_gt_inds = self.dynamic_k_matching( cost_matrix, pairwise_ious, num_gt, valid_mask) # convert to AssignResult format assigned_gt_inds[valid_mask] = matched_gt_inds + 1 assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1) assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long() max_overlaps = assigned_gt_inds.new_full((num_bboxes, ), -INF, dtype=torch.float32) max_overlaps[valid_mask] = matched_pred_ious if assign_on_cpu: # num_gt = num_gt.to(device) assigned_gt_inds = assigned_gt_inds.to(device) max_overlaps = max_overlaps.to(device) assigned_labels = assigned_labels.to(device) return AssignResult( num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels) def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor, num_gt: int, valid_mask: Tensor) -> Tuple[Tensor, Tensor]: """Use IoU and matching cost to calculate the dynamic top-k positive targets. Same as SimOTA. Args: cost (Tensor): Cost matrix. pairwise_ious (Tensor): Pairwise iou matrix. num_gt (int): Number of gt. valid_mask (Tensor): Mask for valid bboxes. Returns: tuple: matched ious and gt indexes. """ matching_matrix = torch.zeros_like(cost, dtype=torch.uint8) # select candidate topk ious for dynamic-k calculation candidate_topk = min(self.topk, pairwise_ious.size(0)) topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0) # calculate dynamic k for each gt dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1) for gt_idx in range(num_gt): _, pos_idx = torch.topk( cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False) matching_matrix[:, gt_idx][pos_idx] = 1 del topk_ious, dynamic_ks, pos_idx prior_match_gt_mask = matching_matrix.sum(1) > 1 if prior_match_gt_mask.sum() > 0: cost_min, cost_argmin = torch.min( cost[prior_match_gt_mask, :], dim=1) matching_matrix[prior_match_gt_mask, :] *= 0 matching_matrix[prior_match_gt_mask, cost_argmin] = 1 # get foreground mask inside box and center prior fg_mask_inboxes = matching_matrix.sum(1) > 0 valid_mask[valid_mask.clone()] = fg_mask_inboxes matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1) matched_pred_ious = (matching_matrix * pairwise_ious).sum(1)[fg_mask_inboxes] return matched_pred_ious, matched_gt_inds
This solution slows down training quite significantly.
When I use 4x12G (3070Ti) to train RTMDet-ins-cspnext-tiny
, some GPUs have very low memory usage, while others are very saturated.
I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.
@TASK_UTILS.register_module() class DynamicSoftLabelAssigner(BaseAssigner): """Computes matching between predictions and ground truth with dynamic soft label assignment. Args: soft_center_radius (float): Radius of the soft center prior. Defaults to 3.0. topk (int): Select top-k predictions to calculate dynamic k best matches for each gt. Defaults to 13. iou_weight (float): The scale factor of iou cost. Defaults to 3.0. iou_calculator (ConfigType): Config of overlaps Calculator. Defaults to dict(type='BboxOverlaps2D'). """ def __init__( self, soft_center_radius: float = 3.0, topk: int = 13, iou_weight: float = 3.0, gpu_assign_thr: float = -1, iou_calculator: ConfigType = dict(type='BboxOverlaps2D')): self.soft_center_radius = soft_center_radius self.topk = topk self.iou_weight = iou_weight # ic(gpu_assign_thr) self.gpu_assign_thr = gpu_assign_thr self.iou_calculator = TASK_UTILS.build(iou_calculator) def assign(self, pred_instances: InstanceData, gt_instances: InstanceData, gt_instances_ignore: Optional[InstanceData] = None, **kwargs) -> AssignResult: """Assign gt to priors. Args: pred_instances (:obj:`InstanceData`): Instances of model predictions. It includes ``priors``, and the priors can be anchors or points, or the bboxes predicted by the previous stage, has shape (n, 4). The bboxes predicted by the current model or stage will be named ``bboxes``, ``labels``, and ``scores``, the same as the ``InstanceData`` in other places. gt_instances (:obj:`InstanceData`): Ground truth of instance annotations. It usually includes ``bboxes``, with shape (k, 4), and ``labels``, with shape (k, ). gt_instances_ignore (:obj:`InstanceData`, optional): Instances to be ignored during training. It includes ``bboxes`` attribute data that is ignored during training and testing. Defaults to None. Returns: obj:`AssignResult`: The assigned result. """ gt_bboxes = gt_instances.bboxes gt_labels = gt_instances.labels num_gt = gt_bboxes.size(0) decoded_bboxes = pred_instances.bboxes pred_scores = pred_instances.scores priors = pred_instances.priors num_bboxes = decoded_bboxes.size(0) # ic(gt_bboxes.shape[0]) # ic(self.gpu_assign_thr) assign_on_cpu = True if (self.gpu_assign_thr > 0) and ( gt_bboxes.shape[0] > self.gpu_assign_thr) else False # ic(assign_on_cpu) # compute overlap and assign gt on CPU when number of GT is large if assign_on_cpu: # ic('assigning on cpu') device = priors.device priors = priors.cpu() gt_bboxes = gt_bboxes.cpu() gt_labels = gt_labels.cpu() decoded_bboxes = decoded_bboxes.cpu() pred_scores = pred_scores.cpu() # if gt_bboxes_ignore is not None: # gt_bboxes_ignore = gt_bboxes_ignore.cpu() # assign 0 by default assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ), 0, dtype=torch.long) if num_gt == 0 or num_bboxes == 0: # No ground truth or boxes, return empty assignment max_overlaps = decoded_bboxes.new_zeros((num_bboxes, )) if num_gt == 0: # No truth, assign everything to background assigned_gt_inds[:] = 0 assigned_labels = decoded_bboxes.new_full((num_bboxes, ), -1, dtype=torch.long) if assign_on_cpu: # num_gt = num_gt.to(device) assigned_gt_inds = assigned_gt_inds.to(device) max_overlaps = max_overlaps.to(device) assigned_labels = assigned_labels.to(device) return AssignResult( num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels) prior_center = priors[:, :2] if isinstance(gt_bboxes, BaseBoxes): is_in_gts = gt_bboxes.find_inside_points(prior_center) else: # Tensor boxes will be treated as horizontal boxes by defaults lt_ = prior_center[:, None] - gt_bboxes[:, :2] rb_ = gt_bboxes[:, 2:] - prior_center[:, None] deltas = torch.cat([lt_, rb_], dim=-1) is_in_gts = deltas.min(dim=-1).values > 0 valid_mask = is_in_gts.sum(dim=1) > 0 valid_decoded_bbox = decoded_bboxes[valid_mask] valid_pred_scores = pred_scores[valid_mask] num_valid = valid_decoded_bbox.size(0) if num_valid == 0: # No ground truth or boxes, return empty assignment max_overlaps = decoded_bboxes.new_zeros((num_bboxes, )) assigned_labels = decoded_bboxes.new_full((num_bboxes, ), -1, dtype=torch.long) if assign_on_cpu: # num_gt = num_gt.to(device) assigned_gt_inds = assigned_gt_inds.to(device) max_overlaps = max_overlaps.to(device) assigned_labels = assigned_labels.to(device) return AssignResult( num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels) if hasattr(gt_instances, 'masks'): gt_center = center_of_mass(gt_instances.masks, eps=EPS) elif isinstance(gt_bboxes, BaseBoxes): gt_center = gt_bboxes.centers else: # Tensor boxes will be treated as horizontal boxes by defaults gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0 valid_prior = priors[valid_mask] strides = valid_prior[:, 2] distance = (valid_prior[:, None, :2] - gt_center[None, :, :] ).pow(2).sum(-1).sqrt() / strides[:, None] soft_center_prior = torch.pow(10, distance - self.soft_center_radius) pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes) iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight gt_onehot_label = ( F.one_hot(gt_labels.to(torch.int64), pred_scores.shape[-1]).float().unsqueeze(0).repeat( num_valid, 1, 1)) valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1) soft_label = gt_onehot_label * pairwise_ious[..., None] scale_factor = soft_label - valid_pred_scores.sigmoid() soft_cls_cost = F.binary_cross_entropy_with_logits( valid_pred_scores, soft_label, reduction='none') * scale_factor.abs().pow(2.0) soft_cls_cost = soft_cls_cost.sum(dim=-1) cost_matrix = soft_cls_cost + iou_cost + soft_center_prior matched_pred_ious, matched_gt_inds = self.dynamic_k_matching( cost_matrix, pairwise_ious, num_gt, valid_mask) # convert to AssignResult format assigned_gt_inds[valid_mask] = matched_gt_inds + 1 assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1) assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long() max_overlaps = assigned_gt_inds.new_full((num_bboxes, ), -INF, dtype=torch.float32) max_overlaps[valid_mask] = matched_pred_ious if assign_on_cpu: # num_gt = num_gt.to(device) assigned_gt_inds = assigned_gt_inds.to(device) max_overlaps = max_overlaps.to(device) assigned_labels = assigned_labels.to(device) return AssignResult( num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels) def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor, num_gt: int, valid_mask: Tensor) -> Tuple[Tensor, Tensor]: """Use IoU and matching cost to calculate the dynamic top-k positive targets. Same as SimOTA. Args: cost (Tensor): Cost matrix. pairwise_ious (Tensor): Pairwise iou matrix. num_gt (int): Number of gt. valid_mask (Tensor): Mask for valid bboxes. Returns: tuple: matched ious and gt indexes. """ matching_matrix = torch.zeros_like(cost, dtype=torch.uint8) # select candidate topk ious for dynamic-k calculation candidate_topk = min(self.topk, pairwise_ious.size(0)) topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0) # calculate dynamic k for each gt dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1) for gt_idx in range(num_gt): _, pos_idx = torch.topk( cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False) matching_matrix[:, gt_idx][pos_idx] = 1 del topk_ious, dynamic_ks, pos_idx prior_match_gt_mask = matching_matrix.sum(1) > 1 if prior_match_gt_mask.sum() > 0: cost_min, cost_argmin = torch.min( cost[prior_match_gt_mask, :], dim=1) matching_matrix[prior_match_gt_mask, :] *= 0 matching_matrix[prior_match_gt_mask, cost_argmin] = 1 # get foreground mask inside box and center prior fg_mask_inboxes = matching_matrix.sum(1) > 0 valid_mask[valid_mask.clone()] = fg_mask_inboxes matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1) matched_pred_ious = (matching_matrix * pairwise_ious).sum(1)[fg_mask_inboxes] return matched_pred_ious, matched_gt_inds
This solution slows down training quite significantly.
Certainly, the calculations are being performed on the CPU, which is relatively slow. As an alternative, I implemented a try-except block to handle CUDA out-of-memory (OOM) errors. So you assign on CPU only when you run out of memory on GPU
I found that you can solve this issue by limiting the number of bounding-boxes used per image. In mmyolo/data/transforms.py add the following class to randomly pick a maximum number of boxes each time the image is loaded:
@TRANSFORMS.register_module()
class LimitBBoxes:
def __init__(self, max_bboxes):
self.max_bboxes = max_bboxes
def __call__(self, results):
num_bboxes = len(results['gt_bboxes'])
if num_bboxes > self.max_bboxes:
indices = np.random.choice(num_bboxes, self.max_bboxes, replace=False)
results['gt_bboxes'] = results['gt_bboxes'][indices]
if 'gt_ignore_flags' in results:
results['gt_ignore_flags'] = results['gt_ignore_flags'][indices]
if 'gt_bboxes_labels' in results:
results['gt_bboxes_labels'] = results['gt_bboxes_labels'][indices]
if 'gt_labels' in results:
results['gt_labels'] = results['gt_labels'][indices]
return results
Also add this new class to the init.py file in "transforms" folder, and finally add to config like so:
train_pipeline = [
dict(type='LoadImageFromFile', backend_args=_base_.backend_args),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='LimitBBoxes', max_bboxes=10),
........
I found that you can solve this issue by limiting the number of bounding-boxes used per image. In mmyolo/data/transforms.py add the following class to randomly pick a maximum number of boxes each time the image is loaded:
@TRANSFORMS.register_module() class LimitBBoxes: def __init__(self, max_bboxes): self.max_bboxes = max_bboxes def __call__(self, results): num_bboxes = len(results['gt_bboxes']) if num_bboxes > self.max_bboxes: indices = np.random.choice(num_bboxes, self.max_bboxes, replace=False) results['gt_bboxes'] = results['gt_bboxes'][indices] if 'gt_ignore_flags' in results: results['gt_ignore_flags'] = results['gt_ignore_flags'][indices] if 'gt_bboxes_labels' in results: results['gt_bboxes_labels'] = results['gt_bboxes_labels'][indices] if 'gt_labels' in results: results['gt_labels'] = results['gt_labels'][indices] return results
Also add this new class to the init.py file in "transforms" folder, and finally add to config like so:
train_pipeline = [ dict(type='LoadImageFromFile', backend_args=_base_.backend_args), dict(type='LoadAnnotations', with_bbox=True), dict(type='LimitBBoxes', max_bboxes=10), ........
Is this transform also reducing the masks number ?
@big-gandalf No, in my case I did not use the masks. You can easily add a limit for the masks in a similar way, by replacing the 'gt_bboxes' key with the key for the masks.
@big-gandalf No, in my case I did not use the masks. You can easily add a limit for the masks in a similar way, by replacing the 'gt_bboxes' key with the key for the masks.
Actually I am little confused. I am trying to implement your solution for my rtmdet instance segmentation model. In my case I don't interest the bboxes. I don't want to reduce the number of masks. Does this solution help for my case?
@big-gandalf No, in my case I did not use the masks. You can easily add a limit for the masks in a similar way, by replacing the 'gt_bboxes' key with the key for the masks.
Actually I am little confused. I am trying to implement your solution for my rtmdet instance segmentation model. In my case I don't interest the bboxes. I don't want to reduce the number of masks. Does this solution help for my case?
If you are using instance segmentation it's the same thing, the code will crash if the number of instances (masks) per image is too high, so you need to apply the same solution.
Prerequisite
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x
Environment
Additional installation/environment information
Installed inside a docker container based on the example Dockerfile but pulls
dev-3.x
because I started working on this before it was merged into3.x
but I verified that there haven't been any changes to the specific code snippets that would help with the OOM error.Reproduces the problem - code sample
Config file run for training. Classes and meta info hidden for privacy:
Additional information
Expected Result
Training without an OOM error.
Dataset
Custom dataset of 100 images for instance segmentation with >200 polygons per image. Resolution of images: 1280 x 720
Hardware
NVIDIA RTX 2060. Also tried training on NVIDIA RTX 3080. Both have a GPU memory of 12 GB.
Additional description/information
Based on reading the FAQ and looking through issues #188 and [#1581], (https://github.com/open-mmlab/mmdetection/issues/1581), and given the high number of ground truth per image, I assumed that the problem was that
assign_gpu_thr
needed to be set to a number so the assign computation takes place in the CPU instead of the GPU.However, rtmdet uses
DynamicSoftLabelAssigner
and notMaxIoUAssigner
, which does not have anassign_gpu_thr
parameter that is configurable. Switching the assigner toMaxIoUAssigner
in the config as shown belowresulted in the following output:
However, switching to the
MaxIouAssigner
did not lead to an OOM error for multiple epochs, which biases me to think the problem is the high number of polygons. But inference with the trained model outputs no predictions and as shown in the log above, throws an error saying that the testing results of the whole dataset is empty. Reading through the issues (#9381), this is sometimes attributed to incorrect format of the ground truth labels but since the data has not changed, this doesn't seem plausible.To summarize:
assign_gpu_thr
toDynamicSoftLabelAssigner
?MaxIoUAssigner
with RTMDet result in no inference results? This seems like a bug.with_cp
(suggested in the FAQ for OOM issues) does not exist inCSPNext
which is the backbone for RTMDet. Are there plans to add it?I'm not sure if this is entirely a bug or a feature request but it seems to be a bit of both.