Slow feature extraction compared to bottom-up-attention

vinson2233 commented 3 years ago

Hi, thanks for the great work and open-sourcing this project.

I'm excited to try VinVL since it promises faster computation time for the feature extraction part as written in the paper compared to bottom-up-attention

I have created my own TSV file using tsv_demo.py and ran tools/test_sg_net.py to do feature extraction. The sad thing is the feature extraction runs quite slowly. Right now I'm using Pytorch 1.7, Debian 10, with 1 Nvidia T4. The feature extraction process took 9 second / 4 images.

I used bottom-up-attention from https://github.com/airsplay/py-bottom-up-attention and https://github.com/peteanderson80/bottom-up-attention while using OSCAR on the same dataset. these repo give much faster feature extraction time (the first repo need 2.7 seconds / 8 images, while the original caffe bottom-up took less than 1 second for 1 image ) on a similar machine. This contradicts what written in your paper.

Here's some key config that I'm using while running the tools/test_sg_net.py

TEST:
    IMS_PER_BATCH: 4
    IGNORE_BOX_REGRESSION: True
    SKIP_PERFORMANCE_EVAL: True
    SAVE_PREDICTIONS: True
    SAVE_RESULTS_TO_TSV: True
    TSV_SAVE_SUBSET: ['rect', 'class', 'conf', 'feature']
    GATHER_ON_CPU: True
    OUTPUT_FEATURE : True

I'm check my nvidia-smi and it showing my GPU is working.

Is anyone else have this issue also?

vinson2233 commented 3 years ago

More detailed information printed at the beginning of test_sg_net.py

 DATA_DIR "/home/jupyter/preprocessed_dataset/gambar_tokped"
2021-04-14 11:03:51,401 maskrcnn_benchmark INFO: Using 1 GPUs
2021-04-14 11:03:51,401 maskrcnn_benchmark INFO: AMP_VERBOSE: False
DATALOADER:
  ASPECT_RATIO_GROUPING: True
  NUM_WORKERS: 0
  SIZE_DIVISIBILITY: 0
DATASETS:
  FACTORY_TEST: ('ODTSVDataset',)
  FACTORY_TRAIN: ()
  LABELMAP_FILE: /home/jupyter/scene_graph_benchmark/models/vinvl/VG-SGG-dicts-vgoi6-clipped.json
  TEST: ('/home/jupyter/preprocessed_dataset/moderated_content/config.yaml',)
  TRAIN: ()
DATA_DIR: /home/jupyter/preprocessed_dataset/gambar_tokped
DISTRIBUTED_BACKEND: gloo
DTYPE: float32
INPUT:
  BRIGHTNESS: 0.0
  CONTRAST: 0.0
  HORIZONTAL_FLIP_PROB_TRAIN: 0.5
  HUE: 0.0
  MAX_SIZE_TEST: 1000
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 600
  MIN_SIZE_TRAIN: (800,)
  PIXEL_MEAN: [103.53, 116.28, 123.675]
  PIXEL_STD: [1.0, 1.0, 1.0]
  SATURATION: 0.0
  TO_BGR255: True
  VERTICAL_FLIP_PROB_TRAIN: 0.0
MODEL:
  ATTRIBUTE_ON: True
  BACKBONE:
    CONV_BODY: R-152-C4
    FREEZE_CONV_BODY_AT: 2
  CLS_AGNOSTIC_BBOX_REG: False
  DEVICE: cuda
  FBNET:
    ARCH: default
    ARCH_DEF: 
    BN_TYPE: bn
    DET_HEAD_BLOCKS: []
    DET_HEAD_LAST_SCALE: 1.0
    DET_HEAD_STRIDE: 0
    DW_CONV_SKIP_BN: True
    DW_CONV_SKIP_RELU: True
    KPTS_HEAD_BLOCKS: []
    KPTS_HEAD_LAST_SCALE: 0.0
    KPTS_HEAD_STRIDE: 0
    MASK_HEAD_BLOCKS: []
    MASK_HEAD_LAST_SCALE: 0.0
    MASK_HEAD_STRIDE: 0
    RPN_BN_TYPE: 
    RPN_HEAD_BLOCKS: 0
    SCALE_FACTOR: 1.0
    WIDTH_DIVISOR: 1
  FPN:
    USE_GN: False
    USE_RELU: False
  FREQ_PRIOR: visualgenome/label_danfeiX_clipped.freq_prior.npy
  GROUP_NORM:
    DIM_PER_GP: -1
    EPSILON: 1e-05
    NUM_GROUPS: 32
  KEYPOINT_ON: False
  MASK_ON: False
  META_ARCHITECTURE: AttrRCNN
  RELATION_ON: False
  RESNETS:
    BACKBONE_OUT_CHANNELS: 1024
    DEFORMABLE_GROUPS: 1
    NUM_GROUPS: 32
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STAGE_WITH_DCN: (False, False, False, False)
    STEM_FUNC: StemWithFixedBatchNorm
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: False
    TRANS_FUNC: BottleneckWithFixedBatchNorm
    WIDTH_PER_GROUP: 8
    WITH_MODULATED_DCN: False
  RETINANET:
    ANCHOR_SIZES: (32, 64, 128, 256, 512)
    ANCHOR_STRIDES: (8, 16, 32, 64, 128)
    ASPECT_RATIOS: (0.5, 1.0, 2.0)
    BBOX_REG_BETA: 0.11
    BBOX_REG_WEIGHT: 4.0
    BG_IOU_THRESHOLD: 0.4
    FG_IOU_THRESHOLD: 0.5
    INFERENCE_TH: 0.05
    LOSS_ALPHA: 0.25
    LOSS_GAMMA: 2.0
    NMS_TH: 0.4
    NUM_CLASSES: 81
    NUM_CONVS: 4
    OCTAVE: 2.0
    PRE_NMS_TOP_N: 1000
    PRIOR_PROB: 0.01
    SCALES_PER_OCTAVE: 3
    STRADDLE_THRESH: 0
    USE_C5: True
  RETINANET_ON: False
  ROI_ATTRIBUTE_HEAD:
    ATTR_EMD_DIM: 512
    CLS_EMD_DIM: 256
    FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
    LOSS_WEIGHT: 0.5
    MAX_NUM_ATTR_PER_IMG: 100
    MAX_NUM_ATTR_PER_OBJ: 16
    MLP_HEAD_DIM: 1024
    NUM_ATTRIBUTES: 525
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_SCALES: (0.0625,)
    POSTPROCESS_ATTRIBUTES_THRESHOLD: 0.05
    PREDICTOR: AttributeRCNNPredictor
    SHARE_BOX_FEATURE_EXTRACTOR: True
  ROI_BOX_HEAD:
    CONV_HEAD_DIM: 256
    DILATION: 1
    FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
    FORCE_BOXES: False
    MLP_HEAD_DIM: 1024
    NUM_CLASSES: 1595
    NUM_STACKED_CONVS: 4
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_SCALES: (0.0625,)
    PREDICTOR: FastRCNNPredictor
    USE_GN: False
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 384
    BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0)
    BG_IOU_THRESHOLD: 0.5
    DETECTIONS_PER_IMG: 100
    FG_IOU_THRESHOLD: 0.5
    MIN_DETECTIONS_PER_IMG: 10
    NMS: 0.5
    NMS_FILTER: 1
    POSITIVE_FRACTION: 0.5
    SCORE_THRESH: 0.2
    USE_FPN: False
  ROI_KEYPOINT_HEAD:
    CONV_LAYERS: (512, 512, 512, 512, 512, 512, 512, 512)
    FEATURE_EXTRACTOR: KeypointRCNNFeatureExtractor
    MLP_HEAD_DIM: 1024
    NUM_CLASSES: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_SCALES: (0.0625,)
    PREDICTOR: KeypointRCNNPredictor
    RESOLUTION: 14
    SHARE_BOX_FEATURE_EXTRACTOR: True
  ROI_MASK_HEAD:
    CONV_LAYERS: (256, 256, 256, 256)
    DILATION: 1
    FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
    MLP_HEAD_DIM: 1024
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_SCALES: (0.0625,)
    POSTPROCESS_MASKS: False
    POSTPROCESS_MASKS_THRESHOLD: 0.5
    PREDICTOR: MaskRCNNC4Predictor
    RESOLUTION: 14
    SHARE_BOX_FEATURE_EXTRACTOR: True
    USE_GN: False
  ROI_RELATION_HEAD:
    ALGORITHM: sg_baseline
    BACKBONE_FREEZE_PARAMETER: True
    BATCH_SIZE_PER_IMAGE: 512
    CONCATENATE_PROPOSAL_GT: False
    CONTRASTIVE_LOSS:
      BG_THRESH_HI: 0.5
      BG_THRESH_LO: 0.0
      FG_REL_FRACTION: 0.25
      FG_REL_SIZE_PER_IM: 512
      FG_THRESH: 0.5
      NODE_CONTRASTIVE_MARGIN: 0.2
      NODE_CONTRASTIVE_P_AWARE_MARGIN: 0.2
      NODE_CONTRASTIVE_P_AWARE_WEIGHT: 0.1
      NODE_CONTRASTIVE_SO_AWARE_MARGIN: 0.2
      NODE_CONTRASTIVE_SO_AWARE_WEIGHT: 0.5
      NODE_CONTRASTIVE_WEIGHT: 1.0
      NODE_SAMPLE_SIZE: 128
      USE_BG: True
      USE_FLAG: False
      USE_FREQ_BIAS: True
      USE_NODE_CONTRASTIVE_LOSS: True
      USE_NODE_CONTRASTIVE_P_AWARE_LOSS: True
      USE_NODE_CONTRASTIVE_SO_AWARE_LOSS: True
      USE_SPATIAL_FEAT: False
      USE_SPO_AGNOSTIC_COMPENSATION: False
    CONV_HEAD_DIM: 256
    DETECTOR_BOX_THRESHOLD: 0.0
    DETECTOR_PRE_CALCULATED: False
    DILATION: 1
    FEATURE_EXTRACTOR: ResNet50Conv5ROIRelationFeatureExtractor
    FILTER_NON_OVERLAP: True
    FORCE_RELATIONS: False
    GRCNN_FEATURE_UPDATE_STEP: 0
    GRCNN_SCORE_UPDATE_STEP: 0
    IMP_FEATURE_UPDATE_STEP: 0
    MLP_HEAD_DIM: 1024
    MODE: sgdet
    MSDN_FEATURE_UPDATE_STEP: 0
    NEURAL_MOTIF:
      DEBUG: False
      DROPOUT: 0.0
      EDGE_LSTM_NUM_LAYERS: 4
      EMBED_DIM: 100
      GLOVE_PATH: glove/
      HIDDEN_DIM: 256
      NUM_OBJS: 64
      OBJ_CLASSES_FN: visualgenome/label_danfeiX_clipped.obj_classes.txt
      OBJ_FEAT_TO_DECODER: False
      OBJ_FEAT_TO_EDGE: False
      OBJ_LSTM_NUM_LAYERS: 2
      ORDER: confidence
      POS_BATCHNORM_MOMENTUM: 0.001
      POS_EMBED_DIM: 128
      REL_CLASSES_FN: visualgenome/label_danfeiX_clipped.rel_classes.txt
      USE_TANH: False
    NUM_CLASSES: 51
    NUM_STACKED_CONVS: 4
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_SCALES: (0.0625,)
    POSITIVE_FRACTION: 0.25
    POSTPROCESS_METHOD: constrained
    POSTPROCESS_SCORE_THRESH: 1e-05
    POST_RELPN_PREPOSALS: 512
    PREDICTOR: FastRCNNRelationPredictor
    ROI_BOX_HEAD_FREEZE_PARAMETER: True
    RPN_FREEZE_PARAMETER: True
    SEPERATE_SO_FEATURE_EXTRACTOR: False
    SHARE_BOX_FEATURE_EXTRACTOR: True
    SHARE_CONV_BACKBONE: True
    TRIPLETS_PER_IMG: 100
    UPDATE_BOX_REG: False
    USE_BIAS: False
    USE_GN: False
    USE_ONLINE_OBJ_LABELS: False
    USE_RELPN: False
  RPN:
    ANCHOR_SIZES: (32, 64, 128, 256, 512)
    ANCHOR_STRIDE: (16,)
    ASPECT_RATIOS: (0.5, 1.0, 2.0)
    BATCH_SIZE_PER_IMAGE: 256
    BG_IOU_THRESHOLD: 0.3
    FG_IOU_THRESHOLD: 0.7
    FPN_POST_NMS_PER_BATCH: True
    FPN_POST_NMS_TOP_N_TEST: 2000
    FPN_POST_NMS_TOP_N_TRAIN: 2000
    MIN_SIZE: 0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOP_N_TEST: 300
    POST_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 6000
    PRE_NMS_TOP_N_TRAIN: 12000
    RPN_HEAD: SingleConvRPNHead
    STRADDLE_THRESH: 0
    USE_FPN: False
  RPN_ONLY: False
  USE_FREQ_PRIOR: False
  WEIGHT: models/vinvl/vinvl_vg_x152c4.pth
OUTPUT_DIR: ./output/X152C5_test
PATHS_CATALOG: /home/jupyter/scene_graph_benchmark/maskrcnn_benchmark/config/paths_catalog.py
SOLVER:
  BASE_LR: 0.01
  BIAS_LR_FACTOR: 2
  CHECKPOINT_PERIOD: 10000
  GAMMA: 0.1
  IMS_PER_BATCH: 1
  MAX_ITER: 90000
  MOMENTUM: 0.9
  STEPS: (49000, 65000)
  TEST_PERIOD: 0
  WARMUP_FACTOR: 0.3333333333333333
  WARMUP_ITERS: 500
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0001
  WEIGHT_DECAY_BIAS: 0
TEST:
  BBOX_AUG:
    ENABLED: False
    H_FLIP: False
    MAX_SIZE: 4000
    SCALES: ()
    SCALE_H_FLIP: False
  DETECTIONS_PER_IMG: 100
  EXPECTED_RESULTS: []
  EXPECTED_RESULTS_SIGMA_TOL: 4
  GATHER_ON_CPU: True
  IGNORE_BOX_REGRESSION: True
  IMS_PER_BATCH: 4
  OUTPUT_ATTRIBUTE_FEATURE: False
  OUTPUT_FEATURE: True
  OUTPUT_RELATION_FEATURE: False
  SAVE_PREDICTIONS: True
  SAVE_RESULTS_TO_TSV: True
  SKIP_PERFORMANCE_EVAL: True
  TSV_SAVE_SUBSET: ['rect', 'class', 'conf', 'feature']
2021-04-14 11:03:51,403 maskrcnn_benchmark INFO: Collecting env info (might take some time)
2021-04-14 11:03:54,161 maskrcnn_benchmark INFO: 
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.13.4

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.7.1+cu110
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu110
[conda] _pytorch_select           0.1                       cpu_0  
[conda] blas                      1.0                         mkl    conda-forge
[conda] cudatoolkit               11.0.3               h15472ef_8    conda-forge
[conda] libblas                   3.8.0                    21_mkl    conda-forge
[conda] libcblas                  3.8.0                    21_mkl    conda-forge
[conda] liblapack                 3.8.0                    21_mkl    conda-forge
[conda] libmklml                  2019.0.5                      0  
[conda] mkl                       2020.2                      256  
[conda] numpy                     1.19.5                   pypi_0    pypi
[conda] pytorch                   1.7.0           py3.7_cuda11.0.221_cudnn8.0.3_0    pytorch
[conda] torch                     1.7.1+cu110              pypi_0    pypi
[conda] torchaudio                0.7.2                    pypi_0    pypi
[conda] torchvision               0.8.2+cu110              pypi_0    pypi

pzzhang commented 3 years ago

@vinson2233 , thank you for providing your feedback.

The default feature extraction config above (MODEL.ROI_HEADS.NMS_FILTER = 1) does not use the class-agnostic NMS for inference. If you set MODEL.ROI_HEADS.NMS_FILTER = 2, you will see faster inference speed. PS: in this case, you need to re-finetune the models on the downstream tasks to "get no performance drop" compared with the original class-aware NMS. All current released features are extracted with MODEL.ROI_HEADS.NMS_FILTER = 1, and the released models are based on these MODEL.ROI_HEADS.NMS_FILTER = 1 features. We have some VQA finetuned model with MODEL.ROI_HEADS.NMS_FILTER = 2, which is not released yet. We verified that there is no VQA performance drop when using features with MODEL.ROI_HEADS.NMS_FILTER = 2.

Here is our benchmark on 1 Titan-X GPU and on CPU with a single thread. You can also check the discussions in Appendix G in our VinVL paper https://arxiv.org/pdf/2101.00529.pdf

From our experiences on Titan-X, P40, P100, V100, the VinVL X152-C4 model is slightly faster than the bottom-up-top-down R101-C4 model during inference, and is about 2 times faster during training. Different types of GPUs and CUDA versions sometimes give very different performance. You may want to turn on the torch.backends.cudnn.benchmark (and ignore the first few examples) to see the performance on your GPU.

Our claim that "These two replacements make the region feature extraction process much faster than that in [2] without any accuracy drop on VL downstream tasks" is based on the same backbone, i.e., R101-C4 with dilation=1 in the head and class-agnostic NMS. Since X152-C4 backbone is larger than that of R101-C4, there is considerable time complexity increase in the backbone; see the comparison of Grid-273 feature extraction between R101-C4 (Vision) and X152-C4 (Vision) in the table above. Overall, the VinVL X152-C4 model is slightly faster than the bottom-up-top-down R101-C4 model during inference, and is about 2 times faster during training.

pzzhang commented 3 years ago

By the way, if you are in a inference-speed critical scenario, the large X154-C4 model is indeed not suitable. For such cases, you may want to consider lighter models, such as our miniVLM https://arxiv.org/pdf/2012.06946.pdf, and/or distill the VinVL large model's knowledge into small models, such as this knowledge distillation work from my collaborators https://arxiv.org/pdf/2104.02096.pdf.

vinson2233 commented 3 years ago

You are right, I just realized the bottom-up-attention repos that I mention are using R101-C4. Sorry for the not apple-to-apple comparison. I have tried torch.backends.cudnn.benchmark and it seems the feature extraction become slower(but in some time it become faster). Maybe it's because my images come in different sizes. I will take a look at the miniVLM and distillation that you mention.

Thanks for the clear answer and gave me alternatives. :)

vinson2233 commented 3 years ago

Last question from me on this thread. It seems that both miniVLM and Knowledge Distilation code has not released to the public yet. Do you know when will it release?

pzzhang commented 3 years ago

There is no clear timeline for releasing miniVLM and Knowledge Distillation code.

vinson2233 commented 3 years ago

Alright, thank you.

microsoft / scene_graph_benchmark

Slow feature extraction compared to bottom-up-attention #8