lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Apache License 2.0
2.31k stars 258 forks source link

Fix for Error Occurring When "use_focal_loss=False" in Postprocessor #429

Open dwchoo opened 1 month ago

dwchoo commented 1 month ago

Fix for Error Occurring When "use_focal_loss=False" in Postprocessor

Modification

scores = F.softmax(logits)[:, :, :-1] -> scores = F.softmax(logits, dim=-1)

Example code

true

# RT-DETR/rtdetr_pytorch
import os 
import sys

import argparse
import numpy as np 

import requests
from PIL import Image

from src.core import YAMLConfig

import torch
import torch.nn as nn 
import torchvision.transforms as transforms

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

# Define the transform to convert the image to a tensor
transform = transforms.Compose([
    transforms.Resize((640, 640)),  # Resize image to 640x640
    transforms.ToTensor(),  # Convert image to tensor [C, H, W]
])

# Apply the transform to convert the image to a tensor
image_tensor = transform(image)

# Add a batch dimension to convert to [B, C, H, W] tensor (B=1)
image_tensor = image_tensor.unsqueeze(0)

config = 'configs/rtdetr/rtdetr_r50vd_6x_coco.yml'
resume = 'rtdetr_r50vd_6x_coco_from_paddle.pth'

cfg = YAMLConfig(config, resume=resume)

if resume:
    checkpoint = torch.load(resume, map_location='cpu') 
    if 'ema' in checkpoint:
        state = checkpoint['ema']['module']
    else:
        state = checkpoint['model']
else:
    raise AttributeError('only support resume to load model.state_dict by now.')

# NOTE load train mode state -> convert to deploy mode
cfg.model.load_state_dict(state)

class Model(nn.Module):
    def __init__(self, ) -> None:
        super().__init__()
        self.model = cfg.model.deploy()
        self.postprocessor = cfg.postprocessor.deploy()
        print(self.postprocessor.deploy_mode)

    def forward(self, images, orig_target_sizes):
        outputs = self.model(images)
        return self.postprocessor(outputs, orig_target_sizes)

model = Model()

dynamic_axes = {
    'images': {0: 'N', },
    'orig_target_sizes': {0: 'N'}
}

#data = torch.rand(1, 3, 640, 640)
size = torch.tensor([[640, 640]])

# Change postprocesor option
model.postprocessor.use_focal_loss = False
print(f"Postprocessing use_focal_loss :{model.postprocessor.use_focal_loss}")

# Predict model
results = model(image_tensor,size)

print("Results")
print(results)

When use_focal_loss=True results

Load PResNet50 state_dict
True
Postprocessing use_focal_loss :True
Results
(tensor([[57, 15, 15, 65, 65, 59, 56,  0,  0,  0, 16,  0, 57, 59, 67,  0, 65, 67,
         57, 57, 57, 60, 57, 57, 59,  0, 59, 57, 57, 16,  0, 67, 57, 57, 57, 59,
         13, 57, 57, 59, 57, 17,  0, 57,  0, 57,  0, 59, 65, 57, 57, 59, 57, 59,
         64, 45, 59, 57, 57, 59, 57, 59, 23, 15, 59, 59, 57, 15, 59, 59, 16, 59,
          0, 59, 15, 15, 57, 15, 57, 57, 57, 57, 24,  0, 58, 57, 51, 57, 59, 24,
         59, 66, 65, 15,  0, 59, 57, 59, 57,  0, 57, 15, 57, 65, 57, 59, 22, 57,
          0, 22, 59,  0, 57, 57, 15,  0, 50, 56, 57, 57, 56, 78, 65, 67, 65, 59,
         57, 57, 57, 71,  0, 57, 57, 59, 57, 17, 59, 59, 41, 57, 59,  0, 65, 41,
         59, 15, 57, 57, 57, 57, 59, 45, 57, 57, 53, 57, 62, 15, 59, 26, 59, 17,
         57, 53, 59, 78, 59, 68, 26, 56, 57, 57, 59, 16, 57, 59, 36,  0, 59, 59,
         59, 15, 57,  0,  0, 57, 24, 59, 59, 59, 16, 15, 57, 57, 59, 57, 73, 73,
          0, 16, 57,  0,  8, 59, 57, 66, 57, 57,  0, 28, 66, 58, 57, 41, 57, 15,
         15, 59,  8, 42, 15, 57, 59, 58,  0, 24,  0, 41,  0, 58,  0, 50, 57, 63,
          9,  0,  0, 56, 71, 57, 56, 56, 57, 14, 56, 59, 67, 57, 51,  8, 59, 63,
         59, 57, 60, 59,  0, 56, 15, 26, 59,  0,  0, 66,  0, 77,  0, 79, 65, 57,
         59,  0, 59, 59,  0, 59, 59, 79, 60, 15,  0, 57, 57, 31, 57, 24, 63,  0,
         60, 56, 57, 47, 46, 16,  0, 57, 57, 14, 56, 15]]), tensor([[[1.3779e-01, 5.0425e-01, 6.4013e+02, 6.3495e+02],
         [3.4338e+02, 3.2369e+01, 6.4014e+02, 4.9533e+02],
         [1.3225e+01, 7.2239e+01, 3.1898e+02, 6.2963e+02],
         ...,
         [3.4338e+02, 3.2369e+01, 6.4014e+02, 4.9533e+02],
         [1.4509e+00, 4.4137e+02, 6.4142e+02, 6.3468e+02],
         [1.1322e+01, 7.2261e+01, 3.1721e+02, 6.3119e+02]]],
       grad_fn=<GatherBackward0>), tensor([[0.9703, 0.9600, 0.9576, 0.9507, 0.9238, 0.2160, 0.1883, 0.0745, 0.0685,
         0.0555, 0.0529, 0.0465, 0.0430, 0.0424, 0.0407, 0.0405, 0.0397, 0.0395,
         0.0368, 0.0368, 0.0365, 0.0355, 0.0349, 0.0349, 0.0340, 0.0337, 0.0334,
         0.0333, 0.0333, 0.0333, 0.0328, 0.0324, 0.0320, 0.0318, 0.0310, 0.0308,
         0.0308, 0.0303, 0.0302, 0.0301, 0.0301, 0.0294, 0.0293, 0.0288, 0.0286,
         0.0286, 0.0281, 0.0281, 0.0273, 0.0273, 0.0271, 0.0267, 0.0263, 0.0263,
         0.0263, 0.0262, 0.0260, 0.0259, 0.0255, 0.0254, 0.0253, 0.0252, 0.0250,
         0.0249, 0.0245, 0.0242, 0.0240, 0.0239, 0.0237, 0.0235, 0.0235, 0.0234,
         0.0233, 0.0233, 0.0232, 0.0230, 0.0230, 0.0228, 0.0227, 0.0225, 0.0225,
         0.0221, 0.0220, 0.0219, 0.0219, 0.0218, 0.0216, 0.0216, 0.0216, 0.0216,
         0.0214, 0.0214, 0.0211, 0.0211, 0.0211, 0.0209, 0.0209, 0.0209, 0.0207,
         0.0206, 0.0204, 0.0204, 0.0204, 0.0202, 0.0200, 0.0199, 0.0197, 0.0196,
         0.0195, 0.0194, 0.0194, 0.0191, 0.0190, 0.0190, 0.0190, 0.0189, 0.0188,
         0.0188, 0.0187, 0.0187, 0.0187, 0.0187, 0.0187, 0.0187, 0.0187, 0.0186,
         0.0186, 0.0184, 0.0184, 0.0184, 0.0183, 0.0182, 0.0181, 0.0180, 0.0180,
         0.0180, 0.0179, 0.0179, 0.0178, 0.0178, 0.0175, 0.0175, 0.0174, 0.0171,
         0.0171, 0.0170, 0.0169, 0.0168, 0.0167, 0.0167, 0.0167, 0.0166, 0.0166,
         0.0166, 0.0166, 0.0166, 0.0166, 0.0166, 0.0165, 0.0165, 0.0164, 0.0164,
         0.0163, 0.0163, 0.0163, 0.0163, 0.0162, 0.0162, 0.0162, 0.0162, 0.0161,
         0.0161, 0.0159, 0.0159, 0.0156, 0.0156, 0.0156, 0.0155, 0.0155, 0.0155,
         0.0155, 0.0154, 0.0154, 0.0154, 0.0154, 0.0154, 0.0153, 0.0153, 0.0153,
         0.0152, 0.0152, 0.0151, 0.0151, 0.0151, 0.0150, 0.0150, 0.0149, 0.0149,
         0.0149, 0.0148, 0.0148, 0.0147, 0.0147, 0.0146, 0.0146, 0.0146, 0.0146,
         0.0145, 0.0145, 0.0145, 0.0145, 0.0145, 0.0145, 0.0143, 0.0143, 0.0142,
         0.0141, 0.0141, 0.0140, 0.0139, 0.0139, 0.0139, 0.0138, 0.0138, 0.0138,
         0.0137, 0.0137, 0.0137, 0.0137, 0.0136, 0.0136, 0.0136, 0.0136, 0.0135,
         0.0135, 0.0135, 0.0135, 0.0134, 0.0134, 0.0134, 0.0134, 0.0134, 0.0134,
         0.0133, 0.0133, 0.0133, 0.0133, 0.0133, 0.0132, 0.0132, 0.0132, 0.0131,
         0.0131, 0.0131, 0.0131, 0.0131, 0.0131, 0.0131, 0.0131, 0.0130, 0.0130,
         0.0130, 0.0130, 0.0130, 0.0129, 0.0129, 0.0129, 0.0128, 0.0128, 0.0128,
         0.0128, 0.0128, 0.0127, 0.0127, 0.0127, 0.0127, 0.0127, 0.0126, 0.0126,
         0.0126, 0.0126, 0.0125, 0.0125, 0.0125, 0.0124, 0.0124, 0.0123, 0.0123,
         0.0122, 0.0122, 0.0122, 0.0122, 0.0122, 0.0122, 0.0122, 0.0121, 0.0121,
         0.0121, 0.0121, 0.0121]], grad_fn=<TopkBackward0>))

When use_focal_loss=False results, BEFORE THE FIX.

scores = F.softmax(logits)[:, :, :-1]

Load PResNet50 state_dict
True
Postprocessing use_focal_loss :False
Results
(tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), tensor([[[ 40.1145,  97.9214, 175.9573, 157.9796],
         [343.3813,  32.3691, 640.1404, 495.3277],
         [ 13.2251,  72.2391, 318.9842, 629.6276],
         ...,
         [358.3041,  32.8866, 640.3602, 278.9484],
         [  4.8246,  80.3851, 387.4113, 632.5491],
         [442.4472, 290.9601, 493.4579, 386.9547]]], grad_fn=<MulBackward0>), tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
       grad_fn=<MaxBackward0>))

When use_focal_loss=False results, AFTER THE FIX.

scores = F.softmax(logits, dim=-1)

NOTE! Installing ujson may make loading annotations faster.
Load PResNet50 state_dict
True
Postprocessing use_focal_loss :False
Results
(tensor([[65, 15, 15, 65, 57, 57, 57, 57, 57, 16, 57, 57, 57, 59, 57, 57, 59, 57,
         57, 57, 57, 59, 59, 16, 59, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57,
         59, 57, 59, 57, 57, 59, 57, 57, 57, 57, 57, 57, 57, 57, 57, 59, 57, 65,
         57, 57, 65, 59, 57, 57, 59, 57,  0, 57, 57, 57, 59, 57, 59, 57, 57, 57,
         57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57,  0,  0, 67, 57,  0,
         57, 57, 57, 59, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 59, 57, 57, 57,
         59, 57, 57, 57, 57, 15, 57,  0, 78, 57, 57,  0, 57,  0, 59, 59, 65, 59,
         57, 57, 57, 57, 59, 57, 59, 57, 59, 57, 57, 57,  0, 57,  0, 57, 57, 57,
         57, 59, 57, 57, 15, 57, 59, 57, 57, 59, 57, 57, 59, 57, 57, 57, 59, 57,
         57,  0, 57, 57, 59, 57, 57, 15, 57, 57, 57, 57, 15, 57, 15, 57, 59,  0,
         57, 57, 57, 57, 57, 57, 59, 59, 57, 57, 65, 57, 57, 57, 65, 15, 57, 57,
         65, 57, 15, 59, 57, 65, 57, 57, 57, 57, 57, 57, 65, 15, 65, 15, 57, 57,
         57, 57, 57, 57, 59, 57, 59, 57, 57, 15, 57, 57, 65, 57, 15, 57, 57, 15,
         15, 57, 57, 15, 15, 67, 59, 15, 57, 65, 65, 57, 57, 65, 57,  0, 15, 57,
         57, 15, 57, 57, 57, 67, 15, 15, 15, 65, 57, 57,  0, 65, 65, 57, 57, 57,
         57,  0, 57, 15, 57, 65, 57, 15, 65, 65, 65, 57, 65, 65, 57, 65, 57, 15,
         57, 57, 57,  0, 57,  0, 15,  0, 57,  0, 57, 15]]), tensor([[[ 40.1145,  97.9214, 175.9573, 157.9796],
         [343.3813,  32.3691, 640.1404, 495.3277],
         [ 13.2251,  72.2391, 318.9842, 629.6276],
         ...,
         [358.3041,  32.8866, 640.3602, 278.9484],
         [  4.8246,  80.3851, 387.4113, 632.5491],
         [442.4472, 290.9601, 493.4579, 386.9547]]], grad_fn=<MulBackward0>), tensor([[0.9751, 0.9755, 0.9719, 0.9657, 0.0399, 0.0453, 0.0517, 0.0488, 0.0585,
         0.1281, 0.0532, 0.0393, 0.0407, 0.2706, 0.0345, 0.0577, 0.0629, 0.1411,
         0.0756, 0.0398, 0.0579, 0.0505, 0.1195, 0.1013, 0.0805, 0.0427, 0.0818,
         0.0345, 0.0832, 0.0823, 0.9806, 0.0546, 0.0323, 0.0569, 0.0329, 0.0868,
         0.0692, 0.0344, 0.0532, 0.1198, 0.1439, 0.0680, 0.0571, 0.0549, 0.0886,
         0.0409, 0.0451, 0.0767, 0.1207, 0.0400, 0.1068, 0.0725, 0.0344, 0.0580,
         0.0335, 0.1349, 0.0371, 0.0661, 0.0643, 0.1191, 0.0979, 0.0340, 0.0661,
         0.0367, 0.1099, 0.0395, 0.0569, 0.0400, 0.0902, 0.1397, 0.1382, 0.0386,
         0.0333, 0.0441, 0.0324, 0.0466, 0.0460, 0.0920, 0.0867, 0.0986, 0.0553,
         0.0724, 0.0441, 0.0615, 0.0688, 0.1670, 0.2506, 0.0833, 0.0847, 0.2086,
         0.0999, 0.1527, 0.0743, 0.0987, 0.0578, 0.0616, 0.0643, 0.1586, 0.0421,
         0.1319, 0.1549, 0.0325, 0.0698, 0.1513, 0.0581, 0.0543, 0.0656, 0.0321,
         0.1277, 0.0488, 0.1109, 0.0659, 0.0873, 0.0758, 0.1007, 0.1404, 0.0304,
         0.0332, 0.0905, 0.2719, 0.0612, 0.0486, 0.0424, 0.0368, 0.0664, 0.0393,
         0.0799, 0.0582, 0.0985, 0.0588, 0.0355, 0.0573, 0.1246, 0.0527, 0.0845,
         0.0427, 0.0336, 0.0351, 0.2501, 0.0322, 0.0756, 0.0756, 0.1015, 0.1412,
         0.0750, 0.1227, 0.0517, 0.1377, 0.0521, 0.0746, 0.1185, 0.0439, 0.0826,
         0.1170, 0.1387, 0.0351, 0.1364, 0.0614, 0.1105, 0.0486, 0.0728, 0.1107,
         0.0653, 0.1629, 0.1362, 0.0907, 0.1179, 0.0737, 0.0323, 0.1059, 0.0805,
         0.0325, 0.0349, 0.0390, 0.0697, 0.0731, 0.0584, 0.0363, 0.0524, 0.0719,
         0.0704, 0.1064, 0.0682, 0.1501, 0.1280, 0.0616, 0.0998, 0.0776, 0.0785,
         0.0458, 0.0498, 0.0353, 0.0917, 0.0324, 0.0726, 0.0374, 0.0518, 0.0429,
         0.0530, 0.1130, 0.0962, 0.0721, 0.0781, 0.0613, 0.0617, 0.0663, 0.1319,
         0.0809, 0.0473, 0.1022, 0.0531, 0.0437, 0.0999, 0.1108, 0.0412, 0.0436,
         0.0560, 0.0475, 0.0375, 0.0333, 0.1421, 0.0588, 0.1480, 0.0391, 0.1145,
         0.0611, 0.0748, 0.0696, 0.0432, 0.0347, 0.0729, 0.0725, 0.0452, 0.0749,
         0.0411, 0.0910, 0.0651, 0.1062, 0.0582, 0.1125, 0.0793, 0.0462, 0.0904,
         0.0958, 0.0360, 0.0879, 0.1399, 0.0637, 0.0378, 0.1588, 0.0576, 0.0706,
         0.0701, 0.0443, 0.0597, 0.0433, 0.0309, 0.1039, 0.0876, 0.0658, 0.0436,
         0.0749, 0.0713, 0.0396, 0.0887, 0.0345, 0.0694, 0.1152, 0.0600, 0.0564,
         0.0328, 0.1529, 0.0331, 0.0392, 0.0602, 0.0658, 0.0398, 0.1068, 0.0388,
         0.0585, 0.0524, 0.0526, 0.0554, 0.0656, 0.1538, 0.0502, 0.0790, 0.0371,
         0.0398, 0.0624, 0.0498, 0.0538, 0.1293, 0.0448, 0.0482, 0.0703, 0.0828,
         0.0764, 0.0698, 0.0303]], grad_fn=<MaxBackward0>))
lyuwenyu commented 1 month ago

You can not only use use_focal_loss=False in postprocess. Beacase use_focal_loss and num_classes are global vars and shared (decoder, matcher, loss and postprocess) in this version codebase

If you want it works, you should add background class. But we do not recommend to do this directly.

If you want to conduct ablation experiments, I suggest that you configure these parameters separately in each module.

dwchoo commented 1 month ago

@lyuwenyu Thank you for your clarification. I strongly agree that the RT-DETR model doesn't require a background class, which aligns with its original design philosophy. From my understanding, the postprocessor isn't directly involved in the training process but rather in interpreting the model's output. I believe there's merit in using a softmax function to determine the final class probabilities. Additionally, filtering objects based on a certain threshold seems like a sound approach. Given that the results don't significantly differ with my proposed method, would it be agreeable to modify the code to allow for the use of both sigmoid and softmax functions? This approach would maintain the model's original functionality while providing additional flexibility for users who might prefer using softmax in certain scenarios. The modification could look something like this:

if use_sigmoid:
    scores = torch.sigmoid(out_logits)
    ...
else:
    scores = torch.nn.functional.softmax(out_logits, dim=-1)
    ...

This way, we preserve the original behavior while adding the option to use softmax without compromising the model's performance or design principles. What are your thoughts on this approach?

lyuwenyu commented 1 month ago

The modification could look something like this:

It looks good to rename use_focal_loss as use_sigmoid in postprocessor.

dwchoo commented 1 month ago

@lyuwenyu Like this?

if use_sigmoid:
    scores = torch.sigmoid(out_logits)
    ...
else:
    scores = torch.nn.functional.softmax(out_logits, dim=-1)
    ...
lyuwenyu commented 1 month ago

Yes. But It may have background classe in the top-300 candidates, which can cause problems with coco eval

dwchoo commented 1 month ago

For the sake of code consistency and readability, I believe it would be beneficial to maintain the use_focal_loss variable. I've noticed several instances where the use_focal_loss name is used in conjunction with sigmoid function. matcher.py, Similarly, there are places where softmax functions are used, but without the [:, :, :-1] slicing operation, aligning with the method I proposed. To maintain consistency across areas where sigmoid is used, I suggest we keep the use_focal_loss variable name. This approach would preserve the uniformity of the code while implementing the softmax function as I suggested earlier.

dwchoo commented 1 month ago

Yes. But It may have background classe in the top-300 candidates, which can cause problems with coco eval

But scores = F.softmax(logits)[:, :, :-1] takes only 79 class, not 80. There is no 'background' class.

lyuwenyu commented 1 month ago

In the original postprocessor, we processed No.0 and No.2 by default ( and we believe that there should not be No.1 case).

I think you want to add No.1 case, but you need to cover all three situations in pr

image
dwchoo commented 1 month ago

Thank you for your kind explanation and clarification.

I appreciate you informing me about the original intention of covering cases No.0 and No.2. This insight is very helpful.

However, I noticed that in your current code, there doesn't seem to be any implementation for adding a background class. Even if the intention was to include case No.2, there appears to be no preprocessor code to perform this task. Given this, should we consider adding preprocessor code to handle case No.2?

From my analysis, the current code structure seems to align more closely with cases No.0 and No.1.

Could you please provide some clarification on how you envision handling case No.2 within the current framework? This would help ensure that our implementation accurately reflects the intended functionality of the model.

Thank you again for your patience and guidance throughout this process. I look forward to your thoughts on this matter.

dwchoo commented 1 month ago

@lyuwenyu After carefully reviewing and considering the code, I have a suggestion I'd like to propose: From my analysis, it appears that the current postprocessor is only considering the No.0 case, where use_focal_loss=True. Given this observation, what are your thoughts on removing the use_focal_loss parameter from the postprocessor altogether, and having it calculate only for the case where use_focal_loss=True? This approach would simplify the postprocessor and align it more closely with its current functionality. I'm interested to hear your perspective on this potential modification.

lyuwenyu commented 1 month ago

If you modify the name use_focal_loss, the existing model config will not be compatible. I suggest not making any changes in this codebase for now. If you have any requirements, you can fork and customize your own needs.


This approach would simplify the postprocessor and align it more closely with its current functionality.

As you said, the exact meaning was not expressed here, but you can explain it through annotations

dwchoo commented 1 month ago

@lyuwenyu, thank you for your reply! I agree that making significant changes to the config isn't ideal, as you pointed out. How about we consider removing or disabling the use_focal_loss parameter only in the postprocessor, which wouldn't affect the model training or the overall code structure? Alternatively, we could disable the use of softmax in this specific part.

RT-DETR has garnered significant attention among object detection models and is being utilized in the Huggingface/transformers package. However, the same issue with the postprocessor is reflected there. I submitted a PR to address this, but since it mirrors the original code(transformers' post_processing), it's currently under discussion with member and contributor.(link)

As you mentioned earlier, there's been extensive debate about whether a background class (or void class) is necessary. We haven't reached a conclusion yet, and we're eagerly awaiting your input on this matter.

Your insights would be invaluable in resolving this issue and improving the model's implementation across different platforms. We appreciate your time and expertise in guiding us through this process.