kemaloksuz / RankSortLoss

Official PyTorch Implementation of Rank & Sort Loss for Object Detection and Instance Segmentation [ICCV2021]
Apache License 2.0
240 stars 26 forks source link

RuntimeError: invalid argument 1: cannot perform reduction function min on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/THC/generic/THCTensorMathReduce.cu:64 #4

Closed trongthuan205 closed 3 years ago

trongthuan205 commented 3 years ago

I trained on my custom dataset with 4 classes; when I used RPNHead, it worked normally. However, when I used rpn_head is RankBasedRPNHead and got an error: AssertionError: The num_classes (1) in RankBasedRPNHead of MMDataParallel does not matches the length of CLASSES 4) in CocoDataset.

I used the same config and just only edited num_classes from 80 to 4 to fit my dataset.

Thanks.

kemaloksuz commented 3 years ago

While we have not tested the code for custom datasets, the code is both working for both COCO and LVIS consistently, so I expect it to work with custom datasets as well.

From the error, I infer that RPN also expects 4 classes in your case. However, RPN yields a single output for each class that is objectness (both standard RPN and ours, this is the same). I think you can check that part of the code.

Could you please share your configuration file also including all details from the base configs? Maybe I can make further comments.

trongthuan205 commented 3 years ago

Thank you. I have fixed the below error, but I met a new error is

RuntimeError: invalid argument 1: cannot perform reduction function min on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/THC/generic/THCTensorMathReduce.cu:64

I trained the model in Google Colab Pro.

kemaloksuz commented 3 years ago

I think the error is triggered by a torch.min(tensor) operation such that "tensor" argument is empty. However, I cannot understand whether it is related to the Rank & Sort Loss, and besides, even if it is due to Rank & Sort Loss, we should be able to see whether the error is triggered in RPN or R-CNN. So, can you please provide the function stack where error occurred (maybe all error details will be enough)? One more question: What batch size are you using?

trongthuan205 commented 3 years ago

image

Yes, it had an error when try to torch.min with an empty tensor. My batch_size is 2

kemaloksuz commented 3 years ago

OK. I have a guess but just to be sure: Could you please first check the values of pos_idx.sum() and IoU_targets in the following line when you have this error:

https://github.com/kemaloksuz/RankSortLoss/blob/ec5e2d8cf5aba4633ca2f8c9bf23d4528413cb56/mmdet/models/dense_heads/rank_based_rpn_head.py#L156

My guess is: If you have pos_idx.sum()>0 and IoU_targets are non-empty but all with zeros; then it implies the box regression head regresses all the anchors to have 0 IoUs and RS Loss considers them as negatives. If that's the case, then you should be able to fix this problem by adding some epsilon (e.g. 1e-8) to the IoUs in the same line (but also please ensure that the IoUs do not exceed 1.0):

Maybe we need to think a better solution for all architectures but "flat_labels[flat_labels==1]=torch.clamp(IoU_targets+1e-8, max = 1.0)" can be a simple fix for you.

Otherwise, looking at the values (pos_idx.sum() and IoU_targets) you provide, I can try to make further comments.

trongthuan205 commented 3 years ago

The pos_idx.sum() = tensor(31, device='cuda:0') and IoU_targets are non-empty but all with zeros as you guess. I have edited as you instruct, but it still struggled with the same error. May you give me more recommendations?

kemaloksuz commented 3 years ago

Now, there are positives for sure since we add some epsilon. So, it should have worked. That's why, I rechecked your screenshot presenting the error: In the screenshot the error occurs at line 23 upon command "threshold_logit = torch.min(fg_logits)-delta_RS". However, in our release this command is at line 17 ("threshold_logit = torch.min(fg_logits)-delta_RS") as follows:

https://github.com/kemaloksuz/RankSortLoss/blob/ec5e2d8cf5aba4633ca2f8c9bf23d4528413cb56/mmdet/models/losses/ranking_losses.py#L17

So, it seems you added 6 extra lines. Are you sure your modifications in RS Loss do not trigger the error? As a result, I recommend first confirming that you have not modified the implementation of RS Loss. Also, the number of positives that you have as pos_idx.sum() should be equal to the number of positives computed in the following line (please also confirm this one; we have not had any problem with COCO or LVIS without any modification but after my recommendation, these two values has to be equal):

https://github.com/kemaloksuz/RankSortLoss/blob/ec5e2d8cf5aba4633ca2f8c9bf23d4528413cb56/mmdet/models/losses/ranking_losses.py#L13

Furthermore, you have 31 positive examples in your last example and all of them have 0 IoU after regression (Note that they have usually IoU>0.70 before regression owing to the assignment rule in RPN). This also did not make sense for me since I do not expect the regressor to assign all of these high quality positives to the locations with IoU=0. If it were 1 or 2, then it could be acceptable to some extent. This is just my intuition and can help you to search for your error.

trongthuan205 commented 3 years ago

Thanks for your response,

I confirm that I didn't edit anything in ranking_losses.py. I just add some new lines to print the value. When removing it, the threshold_logit = torch.min(fg_logits)-delta_RS return to line 17. image

I will try to debug as you recommended. However, if you find new problems in this code, may you give me more suggestions?

Thank you very much.