mlpc-ucsd / TESTR

(CVPR 2022) Text Spotting Transformers
Apache License 2.0
179 stars 22 forks source link

problem in reproduce ICDAR2015 #24

Open HotaekHan opened 1 year ago

HotaekHan commented 1 year ago

Hello, Thanks for your amazing work :)

I tried to reproduce ICDAR 2015 result from paper. But I can't get the result from paper with pre-trained weights.

I'm not changing any code. download dataset and pre-trained weights. train with pre-trained weight. but I got loss almost 30.0~ it looks like not converge.

below is my log.

[07/25 14:21:07] detectron2 INFO: Rank of current process: 0. World size: 8 [07/25 14:21:11] detectron2 INFO: Environment info:


sys.platform linux Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] numpy 1.23.4 detectron2 0.6 @/usr/local/lib/python3.8/dist-packages/detectron2 Compiler GCC 9.4 CUDA compiler CUDA 11.3 detectron2 arch flags 8.6 DETECTRON2_ENV_MODULE PyTorch 1.12.1+cu113 @/usr/local/lib/python3.8/dist-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0,1,2,3,4,5,6,7 Tesla T4 (arch=7.5) Driver version 450.80.02 CUDA_HOME /usr/local/cuda Pillow 9.2.0 torchvision 0.13.1+cu113 @/usr/local/lib/python3.8/dist-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.1.2


PyTorch built with:

[07/25 14:21:11] detectron2 INFO: Command line arguments: Namespace(config_file='configs/TESTR/ICDAR15/TESTR_R_50_Polygon.yaml', dist_url='tcp://127.0.0.1:59588', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False) [07/25 14:21:11] detectron2 INFO: Contents of args.config_file=configs/TESTR/ICDAR15/TESTR_R_50_Polygon.yaml: BASE: "Base-ICDAR15-Polygon.yaml" MODEL: WEIGHTS: "weights/TESTR/pretrain_testr_R_50_polygon.pth" RESNETS: DEPTH: 50 TRANSFORMER: NUM_FEATURE_LEVELS: 4 INFERENCE_TH_TEST: 0.3 ENC_LAYERS: 6 DEC_LAYERS: 6 DIM_FEEDFORWARD: 1024 HIDDEN_DIM: 256 DROPOUT: 0.1 NHEADS: 8 NUM_QUERIES: 100 ENC_N_POINTS: 4 DEC_N_POINTS: 4 SOLVER: IMS_PER_BATCH: 8 BASE_LR: 1e-5 LR_BACKBONE: 1e-6 WARMUP_ITERS: 0

STEPS: (200000,)

MAX_ITER: 200000 CHECKPOINT_PERIOD: 10000 TEST: EVAL_PERIOD: 10000 OUTPUT_DIR: "output/TESTR/icdar15/TESTR_R_50_Polygon"

[07/25 14:21:11] detectron2 INFO: Running with full config: CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: true NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[07/25 14:21:11] detectron2 INFO: Full config saved to output/TESTR/icdar15/TESTR_R_50_Polygon/config.yaml [07/25 14:21:11] d2.utils.env INFO: Using a generated random seed 11819301 [07/25 14:21:13] d2.engine.defaults INFO: Model: TransformerDetector( (testr): TESTR( (backbone): Joiner( (0): MaskedBackbone( (backbone): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) ) (1): PositionalEncoding2D() ) (text_pos_embed): PositionalEncoding1D() (transformer): DeformableTransformer( (encoder): DeformableTransformerEncoder( (layers): ModuleList( (0): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (1): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (2): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (3): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (4): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (5): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) (decoder): DeformableCompositeTransformerDecoder( (layers): ModuleList( (0): DeformableCompositeTransformerDecoderLayer( (attn_cross): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross): Dropout(p=0.1, inplace=False) (norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra): Dropout(p=0.1, inplace=False) (norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter): Dropout(p=0.1, inplace=False) (norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra_text): Dropout(p=0.1, inplace=False) (norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter_text): Dropout(p=0.1, inplace=False) (norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_cross_text): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross_text): Dropout(p=0.1, inplace=False) (norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1_text): Linear(in_features=256, out_features=1024, bias=True) (dropout3_text): Dropout(p=0.1, inplace=False) (linear2_text): Linear(in_features=1024, out_features=256, bias=True) (dropout4_text): Dropout(p=0.1, inplace=False) (norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (1): DeformableCompositeTransformerDecoderLayer( (attn_cross): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross): Dropout(p=0.1, inplace=False) (norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra): Dropout(p=0.1, inplace=False) (norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter): Dropout(p=0.1, inplace=False) (norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra_text): Dropout(p=0.1, inplace=False) (norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter_text): Dropout(p=0.1, inplace=False) (norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_cross_text): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross_text): Dropout(p=0.1, inplace=False) (norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1_text): Linear(in_features=256, out_features=1024, bias=True) (dropout3_text): Dropout(p=0.1, inplace=False) (linear2_text): Linear(in_features=1024, out_features=256, bias=True) (dropout4_text): Dropout(p=0.1, inplace=False) (norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (2): DeformableCompositeTransformerDecoderLayer( (attn_cross): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross): Dropout(p=0.1, inplace=False) (norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra): Dropout(p=0.1, inplace=False) (norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter): Dropout(p=0.1, inplace=False) (norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra_text): Dropout(p=0.1, inplace=False) (norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter_text): Dropout(p=0.1, inplace=False) (norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_cross_text): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross_text): Dropout(p=0.1, inplace=False) (norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1_text): Linear(in_features=256, out_features=1024, bias=True) (dropout3_text): Dropout(p=0.1, inplace=False) (linear2_text): Linear(in_features=1024, out_features=256, bias=True) (dropout4_text): Dropout(p=0.1, inplace=False) (norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (3): DeformableCompositeTransformerDecoderLayer( (attn_cross): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross): Dropout(p=0.1, inplace=False) (norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra): Dropout(p=0.1, inplace=False) (norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter): Dropout(p=0.1, inplace=False) (norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra_text): Dropout(p=0.1, inplace=False) (norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter_text): Dropout(p=0.1, inplace=False) (norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_cross_text): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross_text): Dropout(p=0.1, inplace=False) (norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1_text): Linear(in_features=256, out_features=1024, bias=True) (dropout3_text): Dropout(p=0.1, inplace=False) (linear2_text): Linear(in_features=1024, out_features=256, bias=True) (dropout4_text): Dropout(p=0.1, inplace=False) (norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (4): DeformableCompositeTransformerDecoderLayer( (attn_cross): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross): Dropout(p=0.1, inplace=False) (norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra): Dropout(p=0.1, inplace=False) (norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter): Dropout(p=0.1, inplace=False) (norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra_text): Dropout(p=0.1, inplace=False) (norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter_text): Dropout(p=0.1, inplace=False) (norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_cross_text): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross_text): Dropout(p=0.1, inplace=False) (norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1_text): Linear(in_features=256, out_features=1024, bias=True) (dropout3_text): Dropout(p=0.1, inplace=False) (linear2_text): Linear(in_features=1024, out_features=256, bias=True) (dropout4_text): Dropout(p=0.1, inplace=False) (norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (5): DeformableCompositeTransformerDecoderLayer( (attn_cross): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross): Dropout(p=0.1, inplace=False) (norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra): Dropout(p=0.1, inplace=False) (norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter): Dropout(p=0.1, inplace=False) (norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_intra_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_intra_text): Dropout(p=0.1, inplace=False) (norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_inter_text): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (dropout_inter_text): Dropout(p=0.1, inplace=False) (norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (attn_cross_text): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout_cross_text): Dropout(p=0.1, inplace=False) (norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1_text): Linear(in_features=256, out_features=1024, bias=True) (dropout3_text): Dropout(p=0.1, inplace=False) (linear2_text): Linear(in_features=1024, out_features=256, bias=True) (dropout4_text): Dropout(p=0.1, inplace=False) (norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) (enc_output): Linear(in_features=256, out_features=256, bias=True) (enc_output_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (pos_trans): Linear(in_features=256, out_features=256, bias=True) (pos_trans_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (bbox_class_embed): Linear(in_features=256, out_features=1, bias=True) (bbox_embed): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) ) (ctrl_point_class): ModuleList( (0): Linear(in_features=256, out_features=1, bias=True) (1): Linear(in_features=256, out_features=1, bias=True) (2): Linear(in_features=256, out_features=1, bias=True) (3): Linear(in_features=256, out_features=1, bias=True) (4): Linear(in_features=256, out_features=1, bias=True) (5): Linear(in_features=256, out_features=1, bias=True) ) (ctrl_point_coord): ModuleList( (0): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=2, bias=True) ) ) (1): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=2, bias=True) ) ) (2): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=2, bias=True) ) ) (3): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=2, bias=True) ) ) (4): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=2, bias=True) ) ) (5): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=2, bias=True) ) ) ) (bbox_coord): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (bbox_class): Linear(in_features=256, out_features=1, bias=True) (text_class): Linear(in_features=256, out_features=97, bias=True) (ctrl_point_embed): Embedding(16, 256) (text_embed): Embedding(25, 256) (input_proj): ModuleList( (0): Sequential( (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (1): Sequential( (0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (2): Sequential( (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (3): Sequential( (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) ) ) (criterion): SetCriterion( (enc_matcher): BoxHungarianMatcher() (dec_matcher): CtrlPointHungarianMatcher() ) ) [07/25 14:21:13] d2.data.dataset_mapper INFO: [DatasetMapper] Augmentations used in training: [RandomCrop(crop_type='relative_range', crop_size=[0.1, 0.1]), ResizeShortestEdge(short_edge_length=(800, 832, 864, 896, 1000, 1200, 1400), max_size=2333, sample_style='choice'), RandomFlip()] [07/25 14:21:13] adet.data.dataset_mapper INFO: Rebuilding the augmentations. The previous augmentations will be overridden. [07/25 14:21:13] adet.data.detection_utils INFO: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(800, 832, 864, 896, 1000, 1200, 1400), max_size=2333, sample_style='choice')] [07/25 14:21:13] adet.data.dataset_mapper INFO: Cropping used in training: RandomCropWithInstance(crop_type='relative_range', crop_size=[0.1, 0.1], crop_instance=False) [07/25 14:21:13] adet.data.datasets.text INFO: Loaded 1000 images in COCO format from datasets/icdar2015/train_poly.json [07/25 14:21:13] d2.data.build INFO: Removed 21 images with no usable annotations. 979 images left. [07/25 14:21:13] d2.data.build INFO: Distribution of instances among all 1 categories:  category #instances
text 4468


[07/25 14:21:13] d2.data.build INFO: Using training sampler TrainingSampler [07/25 14:21:13] d2.data.common INFO: Serializing the dataset using: <class 'detectron2.data.common._TorchSerializedList'> [07/25 14:21:13] d2.data.common INFO: Serializing 979 elements to byte tensors and concatenating them all ... [07/25 14:21:13] d2.data.common INFO: Serialized dataset takes 1.64 MiB [07/25 14:21:13] d2.checkpoint.detection_checkpoint INFO: [DetectionCheckpointer] Loading from weights/TESTR/pretrain_testr_R_50_polygon.pth ... [07/25 14:21:13] fvcore.common.checkpoint INFO: [Checkpointer] Loading from weights/TESTR/pretrain_testr_R_50_polygon.pth ... [07/25 14:21:14] adet.trainer INFO: Starting training from iteration 0 [07/25 17:20:06] d2.utils.events INFO: eta: 2 days, 13:01:22 iter: 9359 total_loss: 44.08 loss_ce: 0.783 loss_ctrl_points: 2.31 loss_texts: 3.764 loss_ce_0: 0.8143 loss_ctrl_points_0: 2.423 loss_texts_0: 3.801 loss_ce_1: 0.8142 loss_ctrl_points_1: 2.4 loss_texts_1: 3.759 loss_ce_2: 0.8032 loss_ctrl_points_2: 2.351 loss_texts_2: 3.756 loss_ce_3: 0.7866 loss_ctrl_points_3: 2.334 loss_texts_3: 3.758 loss_ce_4: 0.7786 loss_ctrl_points_4: 2.311 loss_texts_4: 3.77 loss_ce_enc: 0.8066 loss_bbox_enc: 0.3008 loss_giou_enc: 0.7569 time: 1.1431 last_time: 0.8115 data_time: 0.0088 last_data_time: 0.0066 lr: 1e-05 max_mem: 12183M [07/25 17:20:28] d2.utils.events INFO: eta: 2 days, 13:02:11 iter: 9379 total_loss: 42.63 loss_ce: 0.7653 loss_ctrl_points: 2.407 loss_texts: 3.758 loss_ce_0: 0.8062 loss_ctrl_points_0: 2.635 loss_texts_0: 3.792 loss_ce_1: 0.7863 loss_ctrl_points_1: 2.568 loss_texts_1: 3.736 loss_ce_2: 0.7788 loss_ctrl_points_2: 2.537 loss_texts_2: 3.737 loss_ce_3: 0.77 loss_ctrl_points_3: 2.508 loss_texts_3: 3.748 loss_ce_4: 0.7641 loss_ctrl_points_4: 2.456 loss_texts_4: 3.748 loss_ce_enc: 0.7962 loss_bbox_enc: 0.2918 loss_giou_enc: 0.73 time: 1.1431 last_time: 0.9134 data_time: 0.0084 last_data_time: 0.0075 lr: 1e-05 max_mem: 12183M [07/25 17:20:51] d2.utils.events INFO: eta: 2 days, 13:05:45 iter: 9399 total_loss: 44.09 loss_ce: 0.7944 loss_ctrl_points: 2.32 loss_texts: 3.633 loss_ce_0: 0.8154 loss_ctrl_points_0: 2.634 loss_texts_0: 3.668 loss_ce_1: 0.802 loss_ctrl_points_1: 2.506 loss_texts_1: 3.633 loss_ce_2: 0.8023 loss_ctrl_points_2: 2.369 loss_texts_2: 3.626 loss_ce_3: 0.7987 loss_ctrl_points_3: 2.281 loss_texts_3: 3.624 loss_ce_4: 0.7966 loss_ctrl_points_4: 2.309 loss_texts_4: 3.62 loss_ce_enc: 0.8003 loss_bbox_enc: 0.2937 loss_giou_enc: 0.7454 time: 1.1431 last_time: 1.1894 data_time: 0.0081 last_datatime: 0.0227 lr: 1e-05 max