Closed trongthuan205 closed 3 years ago
While we have not tested the code for custom datasets, the code is both working for both COCO and LVIS consistently, so I expect it to work with custom datasets as well.
From the error, I infer that RPN also expects 4 classes in your case. However, RPN yields a single output for each class that is objectness (both standard RPN and ours, this is the same). I think you can check that part of the code.
Could you please share your configuration file also including all details from the base configs? Maybe I can make further comments.
Thank you. I have fixed the below error, but I met a new error is
RuntimeError: invalid argument 1: cannot perform reduction function min on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/THC/generic/THCTensorMathReduce.cu:64
I trained the model in Google Colab Pro.
I think the error is triggered by a torch.min(tensor) operation such that "tensor" argument is empty. However, I cannot understand whether it is related to the Rank & Sort Loss, and besides, even if it is due to Rank & Sort Loss, we should be able to see whether the error is triggered in RPN or R-CNN. So, can you please provide the function stack where error occurred (maybe all error details will be enough)? One more question: What batch size are you using?
Yes, it had an error when try to torch.min with an empty tensor. My batch_size is 2
OK. I have a guess but just to be sure: Could you please first check the values of pos_idx.sum() and IoU_targets in the following line when you have this error:
My guess is: If you have pos_idx.sum()>0 and IoU_targets are non-empty but all with zeros; then it implies the box regression head regresses all the anchors to have 0 IoUs and RS Loss considers them as negatives. If that's the case, then you should be able to fix this problem by adding some epsilon (e.g. 1e-8) to the IoUs in the same line (but also please ensure that the IoUs do not exceed 1.0):
Maybe we need to think a better solution for all architectures but "flat_labels[flat_labels==1]=torch.clamp(IoU_targets+1e-8, max = 1.0)" can be a simple fix for you.
Otherwise, looking at the values (pos_idx.sum() and IoU_targets) you provide, I can try to make further comments.
The pos_idx.sum() = tensor(31, device='cuda:0') and IoU_targets are non-empty but all with zeros as you guess. I have edited as you instruct, but it still struggled with the same error. May you give me more recommendations?
Now, there are positives for sure since we add some epsilon. So, it should have worked. That's why, I rechecked your screenshot presenting the error: In the screenshot the error occurs at line 23 upon command "threshold_logit = torch.min(fg_logits)-delta_RS". However, in our release this command is at line 17 ("threshold_logit = torch.min(fg_logits)-delta_RS") as follows:
So, it seems you added 6 extra lines. Are you sure your modifications in RS Loss do not trigger the error? As a result, I recommend first confirming that you have not modified the implementation of RS Loss. Also, the number of positives that you have as pos_idx.sum() should be equal to the number of positives computed in the following line (please also confirm this one; we have not had any problem with COCO or LVIS without any modification but after my recommendation, these two values has to be equal):
Furthermore, you have 31 positive examples in your last example and all of them have 0 IoU after regression (Note that they have usually IoU>0.70 before regression owing to the assignment rule in RPN). This also did not make sense for me since I do not expect the regressor to assign all of these high quality positives to the locations with IoU=0. If it were 1 or 2, then it could be acceptable to some extent. This is just my intuition and can help you to search for your error.
Thanks for your response,
I confirm that I didn't edit anything in ranking_losses.py. I just add some new lines to print the value. When removing it, the threshold_logit = torch.min(fg_logits)-delta_RS return to line 17.
I will try to debug as you recommended. However, if you find new problems in this code, may you give me more suggestions?
Thank you very much.
I trained on my custom dataset with 4 classes; when I used RPNHead, it worked normally. However, when I used rpn_head is RankBasedRPNHead and got an error: AssertionError: The
num_classes
(1) in RankBasedRPNHead of MMDataParallel does not matches the length ofCLASSES
4) in CocoDataset.I used the same config and just only edited num_classes from 80 to 4 to fit my dataset.
Thanks.