seanzhuh / SeqTR

SeqTR: A Simple yet Universal Network for Visual Grounding
https://arxiv.org/abs/2203.16265
128 stars 14 forks source link

Memory and BatchSize #9

Closed MasterBin-IIAU closed 2 years ago

MasterBin-IIAU commented 2 years ago

Hi, thanks for the wonderful work. I am curious about why SeqTR is so memory-efficient. As shown in the config file, SeqTR is trained with a batch size of 128 on a single 32GB GPU! However, for object detectors like DETR, the batch size on each GPU is quite limited. Could you please give some insights about this? Thanks in advance.

seanzhuh commented 2 years ago
  1. Image Resolution: For visual grounding (VG), the image resolution ranges from 416 x 416 - 640 x 640, For object detection (OD), the resolution is 1333 x 800 in general, or more
  2. Model Architecture: DarkNet-53 (in YOLOv3) is used in VG, which compares favorably over ResNet in OD, see the figure on the ImageNet Classificaiton task. DarkNet
  3. the hidden dimension of transformer in SeqTR is set to 256, and the FFN dimesion is 1024, while that of DETR is 512 and 2048, respectively, besides, we only use 3 decoders compared to 6 decoders in DETR.
MasterBin-IIAU commented 2 years ago

Thanks for your reply. It does make it reasonable :)