There are a few things you should know before making custom loss function.
First: In the training phase, we have implemented to infer and learn (backpropagate) on both prediction of current frame and global ref frames, as this implementation converges faster and performs better.
Second: Because of the above implementation, if you look carefully at the first dimension of outputs[‘pred_logits’], outputs[‘pred_boxes’], and targets, you will notice that it is either 2 or 5.
If 5: ImageNetVID dataset with a 1(current)+ 4(global ref) configuration
If 2: ImageNetDet dataset with a configuration of 1(current) + 1(copy of current) (since ref is not available).
Accordingly, for loss calculation for 1 ITER you need to calculate the loss of boxes of up to 5 frames. This is implemented in detail in loss_labels() function in loss.py.
Also, if you look at the argument of the loss_labels() function, you'll notice that it takes 'indices' as an argument in addition to outputs and targets.
The 'indices' variable is created by self.matcher(which is Hungarian matcher). The number of boxes predicted by the model per frame is 300, while the number of ground truth boxes is much smaller (typically 0-5). Therefore, the matcher function assigns GT labels to the predicted boxes of highest IoUs with GT boxes. This ensures that the trainer calculates losses only on predicted boxes which are adjacent to ground truth foregrounds.
So you may need to make two modifications in order to make well working loss function:
Firstly, make sure your custom loss function considers multiple frames (this may resolve size mismatch error).
Second, make sure your custom loss utilizes the indices argument generated by Hungarian matcher.
There are a few things you should know before making custom loss function.
First: In the training phase, we have implemented to infer and learn (backpropagate) on both prediction of current frame and global ref frames, as this implementation converges faster and performs better.
Second: Because of the above implementation, if you look carefully at the first dimension of outputs[‘pred_logits’], outputs[‘pred_boxes’], and targets, you will notice that it is either 2 or 5. If 5: ImageNetVID dataset with a 1(current)+ 4(global ref) configuration If 2: ImageNetDet dataset with a configuration of 1(current) + 1(copy of current) (since ref is not available).
Accordingly, for loss calculation for 1 ITER you need to calculate the loss of boxes of up to 5 frames. This is implemented in detail in loss_labels() function in loss.py.
Also, if you look at the argument of the loss_labels() function, you'll notice that it takes 'indices' as an argument in addition to outputs and targets. The 'indices' variable is created by self.matcher(which is Hungarian matcher). The number of boxes predicted by the model per frame is 300, while the number of ground truth boxes is much smaller (typically 0-5). Therefore, the matcher function assigns GT labels to the predicted boxes of highest IoUs with GT boxes. This ensures that the trainer calculates losses only on predicted boxes which are adjacent to ground truth foregrounds.
So you may need to make two modifications in order to make well working loss function: Firstly, make sure your custom loss function considers multiple frames (this may resolve size mismatch error). Second, make sure your custom loss utilizes the indices argument generated by Hungarian matcher.