Hi, thanks for the wonderful work.
I am curious about why SeqTR is so memory-efficient. As shown in the config file, SeqTR is trained with a batch size of 128 on a single 32GB GPU! However, for object detectors like DETR, the batch size on each GPU is quite limited. Could you please give some insights about this? Thanks in advance.
Image Resolution: For visual grounding (VG), the image resolution ranges from 416 x 416 - 640 x 640, For object detection (OD), the resolution is 1333 x 800 in general, or more
Model Architecture: DarkNet-53 (in YOLOv3) is used in VG, which compares favorably over ResNet in OD, see the figure on the ImageNet Classificaiton task.
the hidden dimension of transformer in SeqTR is set to 256, and the FFN dimesion is 1024, while that of DETR is 512 and 2048, respectively, besides, we only use 3 decoders compared to 6 decoders in DETR.
Hi, thanks for the wonderful work. I am curious about why SeqTR is so memory-efficient. As shown in the config file, SeqTR is trained with a batch size of 128 on a single 32GB GPU! However, for object detectors like DETR, the batch size on each GPU is quite limited. Could you please give some insights about this? Thanks in advance.