Closed vinson2233 closed 3 years ago
More detailed information printed at the beginning of test_sg_net.py
DATA_DIR "/home/jupyter/preprocessed_dataset/gambar_tokped"
2021-04-14 11:03:51,401 maskrcnn_benchmark INFO: Using 1 GPUs
2021-04-14 11:03:51,401 maskrcnn_benchmark INFO: AMP_VERBOSE: False
DATALOADER:
ASPECT_RATIO_GROUPING: True
NUM_WORKERS: 0
SIZE_DIVISIBILITY: 0
DATASETS:
FACTORY_TEST: ('ODTSVDataset',)
FACTORY_TRAIN: ()
LABELMAP_FILE: /home/jupyter/scene_graph_benchmark/models/vinvl/VG-SGG-dicts-vgoi6-clipped.json
TEST: ('/home/jupyter/preprocessed_dataset/moderated_content/config.yaml',)
TRAIN: ()
DATA_DIR: /home/jupyter/preprocessed_dataset/gambar_tokped
DISTRIBUTED_BACKEND: gloo
DTYPE: float32
INPUT:
BRIGHTNESS: 0.0
CONTRAST: 0.0
HORIZONTAL_FLIP_PROB_TRAIN: 0.5
HUE: 0.0
MAX_SIZE_TEST: 1000
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 600
MIN_SIZE_TRAIN: (800,)
PIXEL_MEAN: [103.53, 116.28, 123.675]
PIXEL_STD: [1.0, 1.0, 1.0]
SATURATION: 0.0
TO_BGR255: True
VERTICAL_FLIP_PROB_TRAIN: 0.0
MODEL:
ATTRIBUTE_ON: True
BACKBONE:
CONV_BODY: R-152-C4
FREEZE_CONV_BODY_AT: 2
CLS_AGNOSTIC_BBOX_REG: False
DEVICE: cuda
FBNET:
ARCH: default
ARCH_DEF:
BN_TYPE: bn
DET_HEAD_BLOCKS: []
DET_HEAD_LAST_SCALE: 1.0
DET_HEAD_STRIDE: 0
DW_CONV_SKIP_BN: True
DW_CONV_SKIP_RELU: True
KPTS_HEAD_BLOCKS: []
KPTS_HEAD_LAST_SCALE: 0.0
KPTS_HEAD_STRIDE: 0
MASK_HEAD_BLOCKS: []
MASK_HEAD_LAST_SCALE: 0.0
MASK_HEAD_STRIDE: 0
RPN_BN_TYPE:
RPN_HEAD_BLOCKS: 0
SCALE_FACTOR: 1.0
WIDTH_DIVISOR: 1
FPN:
USE_GN: False
USE_RELU: False
FREQ_PRIOR: visualgenome/label_danfeiX_clipped.freq_prior.npy
GROUP_NORM:
DIM_PER_GP: -1
EPSILON: 1e-05
NUM_GROUPS: 32
KEYPOINT_ON: False
MASK_ON: False
META_ARCHITECTURE: AttrRCNN
RELATION_ON: False
RESNETS:
BACKBONE_OUT_CHANNELS: 1024
DEFORMABLE_GROUPS: 1
NUM_GROUPS: 32
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STAGE_WITH_DCN: (False, False, False, False)
STEM_FUNC: StemWithFixedBatchNorm
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: False
TRANS_FUNC: BottleneckWithFixedBatchNorm
WIDTH_PER_GROUP: 8
WITH_MODULATED_DCN: False
RETINANET:
ANCHOR_SIZES: (32, 64, 128, 256, 512)
ANCHOR_STRIDES: (8, 16, 32, 64, 128)
ASPECT_RATIOS: (0.5, 1.0, 2.0)
BBOX_REG_BETA: 0.11
BBOX_REG_WEIGHT: 4.0
BG_IOU_THRESHOLD: 0.4
FG_IOU_THRESHOLD: 0.5
INFERENCE_TH: 0.05
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
NMS_TH: 0.4
NUM_CLASSES: 81
NUM_CONVS: 4
OCTAVE: 2.0
PRE_NMS_TOP_N: 1000
PRIOR_PROB: 0.01
SCALES_PER_OCTAVE: 3
STRADDLE_THRESH: 0
USE_C5: True
RETINANET_ON: False
ROI_ATTRIBUTE_HEAD:
ATTR_EMD_DIM: 512
CLS_EMD_DIM: 256
FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
LOSS_WEIGHT: 0.5
MAX_NUM_ATTR_PER_IMG: 100
MAX_NUM_ATTR_PER_OBJ: 16
MLP_HEAD_DIM: 1024
NUM_ATTRIBUTES: 525
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
POSTPROCESS_ATTRIBUTES_THRESHOLD: 0.05
PREDICTOR: AttributeRCNNPredictor
SHARE_BOX_FEATURE_EXTRACTOR: True
ROI_BOX_HEAD:
CONV_HEAD_DIM: 256
DILATION: 1
FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
FORCE_BOXES: False
MLP_HEAD_DIM: 1024
NUM_CLASSES: 1595
NUM_STACKED_CONVS: 4
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
PREDICTOR: FastRCNNPredictor
USE_GN: False
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 384
BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0)
BG_IOU_THRESHOLD: 0.5
DETECTIONS_PER_IMG: 100
FG_IOU_THRESHOLD: 0.5
MIN_DETECTIONS_PER_IMG: 10
NMS: 0.5
NMS_FILTER: 1
POSITIVE_FRACTION: 0.5
SCORE_THRESH: 0.2
USE_FPN: False
ROI_KEYPOINT_HEAD:
CONV_LAYERS: (512, 512, 512, 512, 512, 512, 512, 512)
FEATURE_EXTRACTOR: KeypointRCNNFeatureExtractor
MLP_HEAD_DIM: 1024
NUM_CLASSES: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
PREDICTOR: KeypointRCNNPredictor
RESOLUTION: 14
SHARE_BOX_FEATURE_EXTRACTOR: True
ROI_MASK_HEAD:
CONV_LAYERS: (256, 256, 256, 256)
DILATION: 1
FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor
MLP_HEAD_DIM: 1024
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
POSTPROCESS_MASKS: False
POSTPROCESS_MASKS_THRESHOLD: 0.5
PREDICTOR: MaskRCNNC4Predictor
RESOLUTION: 14
SHARE_BOX_FEATURE_EXTRACTOR: True
USE_GN: False
ROI_RELATION_HEAD:
ALGORITHM: sg_baseline
BACKBONE_FREEZE_PARAMETER: True
BATCH_SIZE_PER_IMAGE: 512
CONCATENATE_PROPOSAL_GT: False
CONTRASTIVE_LOSS:
BG_THRESH_HI: 0.5
BG_THRESH_LO: 0.0
FG_REL_FRACTION: 0.25
FG_REL_SIZE_PER_IM: 512
FG_THRESH: 0.5
NODE_CONTRASTIVE_MARGIN: 0.2
NODE_CONTRASTIVE_P_AWARE_MARGIN: 0.2
NODE_CONTRASTIVE_P_AWARE_WEIGHT: 0.1
NODE_CONTRASTIVE_SO_AWARE_MARGIN: 0.2
NODE_CONTRASTIVE_SO_AWARE_WEIGHT: 0.5
NODE_CONTRASTIVE_WEIGHT: 1.0
NODE_SAMPLE_SIZE: 128
USE_BG: True
USE_FLAG: False
USE_FREQ_BIAS: True
USE_NODE_CONTRASTIVE_LOSS: True
USE_NODE_CONTRASTIVE_P_AWARE_LOSS: True
USE_NODE_CONTRASTIVE_SO_AWARE_LOSS: True
USE_SPATIAL_FEAT: False
USE_SPO_AGNOSTIC_COMPENSATION: False
CONV_HEAD_DIM: 256
DETECTOR_BOX_THRESHOLD: 0.0
DETECTOR_PRE_CALCULATED: False
DILATION: 1
FEATURE_EXTRACTOR: ResNet50Conv5ROIRelationFeatureExtractor
FILTER_NON_OVERLAP: True
FORCE_RELATIONS: False
GRCNN_FEATURE_UPDATE_STEP: 0
GRCNN_SCORE_UPDATE_STEP: 0
IMP_FEATURE_UPDATE_STEP: 0
MLP_HEAD_DIM: 1024
MODE: sgdet
MSDN_FEATURE_UPDATE_STEP: 0
NEURAL_MOTIF:
DEBUG: False
DROPOUT: 0.0
EDGE_LSTM_NUM_LAYERS: 4
EMBED_DIM: 100
GLOVE_PATH: glove/
HIDDEN_DIM: 256
NUM_OBJS: 64
OBJ_CLASSES_FN: visualgenome/label_danfeiX_clipped.obj_classes.txt
OBJ_FEAT_TO_DECODER: False
OBJ_FEAT_TO_EDGE: False
OBJ_LSTM_NUM_LAYERS: 2
ORDER: confidence
POS_BATCHNORM_MOMENTUM: 0.001
POS_EMBED_DIM: 128
REL_CLASSES_FN: visualgenome/label_danfeiX_clipped.rel_classes.txt
USE_TANH: False
NUM_CLASSES: 51
NUM_STACKED_CONVS: 4
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_SCALES: (0.0625,)
POSITIVE_FRACTION: 0.25
POSTPROCESS_METHOD: constrained
POSTPROCESS_SCORE_THRESH: 1e-05
POST_RELPN_PREPOSALS: 512
PREDICTOR: FastRCNNRelationPredictor
ROI_BOX_HEAD_FREEZE_PARAMETER: True
RPN_FREEZE_PARAMETER: True
SEPERATE_SO_FEATURE_EXTRACTOR: False
SHARE_BOX_FEATURE_EXTRACTOR: True
SHARE_CONV_BACKBONE: True
TRIPLETS_PER_IMG: 100
UPDATE_BOX_REG: False
USE_BIAS: False
USE_GN: False
USE_ONLINE_OBJ_LABELS: False
USE_RELPN: False
RPN:
ANCHOR_SIZES: (32, 64, 128, 256, 512)
ANCHOR_STRIDE: (16,)
ASPECT_RATIOS: (0.5, 1.0, 2.0)
BATCH_SIZE_PER_IMAGE: 256
BG_IOU_THRESHOLD: 0.3
FG_IOU_THRESHOLD: 0.7
FPN_POST_NMS_PER_BATCH: True
FPN_POST_NMS_TOP_N_TEST: 2000
FPN_POST_NMS_TOP_N_TRAIN: 2000
MIN_SIZE: 0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOP_N_TEST: 300
POST_NMS_TOP_N_TRAIN: 2000
PRE_NMS_TOP_N_TEST: 6000
PRE_NMS_TOP_N_TRAIN: 12000
RPN_HEAD: SingleConvRPNHead
STRADDLE_THRESH: 0
USE_FPN: False
RPN_ONLY: False
USE_FREQ_PRIOR: False
WEIGHT: models/vinvl/vinvl_vg_x152c4.pth
OUTPUT_DIR: ./output/X152C5_test
PATHS_CATALOG: /home/jupyter/scene_graph_benchmark/maskrcnn_benchmark/config/paths_catalog.py
SOLVER:
BASE_LR: 0.01
BIAS_LR_FACTOR: 2
CHECKPOINT_PERIOD: 10000
GAMMA: 0.1
IMS_PER_BATCH: 1
MAX_ITER: 90000
MOMENTUM: 0.9
STEPS: (49000, 65000)
TEST_PERIOD: 0
WARMUP_FACTOR: 0.3333333333333333
WARMUP_ITERS: 500
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.0001
WEIGHT_DECAY_BIAS: 0
TEST:
BBOX_AUG:
ENABLED: False
H_FLIP: False
MAX_SIZE: 4000
SCALES: ()
SCALE_H_FLIP: False
DETECTIONS_PER_IMG: 100
EXPECTED_RESULTS: []
EXPECTED_RESULTS_SIGMA_TOL: 4
GATHER_ON_CPU: True
IGNORE_BOX_REGRESSION: True
IMS_PER_BATCH: 4
OUTPUT_ATTRIBUTE_FEATURE: False
OUTPUT_FEATURE: True
OUTPUT_RELATION_FEATURE: False
SAVE_PREDICTIONS: True
SAVE_RESULTS_TO_TSV: True
SKIP_PERFORMANCE_EVAL: True
TSV_SAVE_SUBSET: ['rect', 'class', 'conf', 'feature']
2021-04-14 11:03:51,403 maskrcnn_benchmark INFO: Collecting env info (might take some time)
2021-04-14 11:03:54,161 maskrcnn_benchmark INFO:
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.13.4
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.7.1+cu110
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu110
[conda] _pytorch_select 0.1 cpu_0
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.0.3 h15472ef_8 conda-forge
[conda] libblas 3.8.0 21_mkl conda-forge
[conda] libcblas 3.8.0 21_mkl conda-forge
[conda] liblapack 3.8.0 21_mkl conda-forge
[conda] libmklml 2019.0.5 0
[conda] mkl 2020.2 256
[conda] numpy 1.19.5 pypi_0 pypi
[conda] pytorch 1.7.0 py3.7_cuda11.0.221_cudnn8.0.3_0 pytorch
[conda] torch 1.7.1+cu110 pypi_0 pypi
[conda] torchaudio 0.7.2 pypi_0 pypi
[conda] torchvision 0.8.2+cu110 pypi_0 pypi
@vinson2233 , thank you for providing your feedback.
The default feature extraction config above (MODEL.ROI_HEADS.NMS_FILTER = 1) does not use the class-agnostic NMS for inference. If you set MODEL.ROI_HEADS.NMS_FILTER = 2, you will see faster inference speed. PS: in this case, you need to re-finetune the models on the downstream tasks to "get no performance drop" compared with the original class-aware NMS. All current released features are extracted with MODEL.ROI_HEADS.NMS_FILTER = 1, and the released models are based on these MODEL.ROI_HEADS.NMS_FILTER = 1 features. We have some VQA finetuned model with MODEL.ROI_HEADS.NMS_FILTER = 2, which is not released yet. We verified that there is no VQA performance drop when using features with MODEL.ROI_HEADS.NMS_FILTER = 2.
Here is our benchmark on 1 Titan-X GPU and on CPU with a single thread. You can also check the discussions in Appendix G in our VinVL paper https://arxiv.org/pdf/2101.00529.pdf
From our experiences on Titan-X, P40, P100, V100, the VinVL X152-C4 model is slightly faster than the bottom-up-top-down R101-C4 model during inference, and is about 2 times faster during training. Different types of GPUs and CUDA versions sometimes give very different performance. You may want to turn on the torch.backends.cudnn.benchmark (and ignore the first few examples) to see the performance on your GPU.
Our claim that "These two replacements make the region feature extraction process much faster than that in [2] without any accuracy drop on VL downstream tasks" is based on the same backbone, i.e., R101-C4 with dilation=1 in the head and class-agnostic NMS. Since X152-C4 backbone is larger than that of R101-C4, there is considerable time complexity increase in the backbone; see the comparison of Grid-273 feature extraction between R101-C4 (Vision) and X152-C4 (Vision) in the table above. Overall, the VinVL X152-C4 model is slightly faster than the bottom-up-top-down R101-C4 model during inference, and is about 2 times faster during training.
By the way, if you are in a inference-speed critical scenario, the large X154-C4 model is indeed not suitable. For such cases, you may want to consider lighter models, such as our miniVLM https://arxiv.org/pdf/2012.06946.pdf, and/or distill the VinVL large model's knowledge into small models, such as this knowledge distillation work from my collaborators https://arxiv.org/pdf/2104.02096.pdf.
You are right, I just realized the bottom-up-attention repos that I mention are using R101-C4. Sorry for the not apple-to-apple comparison. I have tried torch.backends.cudnn.benchmark and it seems the feature extraction become slower(but in some time it become faster). Maybe it's because my images come in different sizes. I will take a look at the miniVLM and distillation that you mention.
Thanks for the clear answer and gave me alternatives. :)
Last question from me on this thread. It seems that both miniVLM and Knowledge Distilation code has not released to the public yet. Do you know when will it release?
There is no clear timeline for releasing miniVLM and Knowledge Distillation code.
Alright, thank you.
Hi, thanks for the great work and open-sourcing this project.
I'm excited to try VinVL since it promises faster computation time for the feature extraction part as written in the paper compared to bottom-up-attention![image](https://user-images.githubusercontent.com/33550590/114685196-dbba8f00-9d3b-11eb-96dc-c90ded53954b.png)
I have created my own TSV file using
tsv_demo.py
and rantools/test_sg_net.py
to do feature extraction. The sad thing is the feature extraction runs quite slowly. Right now I'm using Pytorch 1.7, Debian 10, with 1 Nvidia T4. The feature extraction process took 9 second / 4 images.I used bottom-up-attention from https://github.com/airsplay/py-bottom-up-attention and https://github.com/peteanderson80/bottom-up-attention while using OSCAR on the same dataset. these repo give much faster feature extraction time (the first repo need 2.7 seconds / 8 images, while the original caffe bottom-up took less than 1 second for 1 image ) on a similar machine. This contradicts what written in your paper.
Here's some key config that I'm using while running the
tools/test_sg_net.py
I'm check my nvidia-smi and it showing my GPU is working.
Is anyone else have this issue also?