Fix Inconsistency Between Training and Inference Caused by Multi-Scale Input Handling for Rectangular Inputs.

Problem: Inconsistency issues identified between the training and inference phases related to how multi-scale inputs are handled.

Details: During training, if the input image's height does not equal its width, our multi-scale process adjusts the image to be square. However, during inference, the inputs are maintained as rectangles. This inconsistency has led to a substantial degradation in model performance, with a decrease in mean Average Precision (mAP) by approximately 20% on our company's datasets.

Modifications:

Multi-Scale Feature Interpolation: During training with multi-scale enabled, input images will be resized to match sz, whichever is larger in width or height.
HybridEncoder Adjustments: In the HybridEncoder, bilinear interpolation is used for better feature alignment since the scale change is not an integer.

lyuwenyu / RT-DETR

Fix Inconsistency Between Training and Inference Caused by Multi-Scale Input Handling for Rectangular Inputs. #422