Problem with testing VSR

cs-yeung76 commented 2 years ago

I was testing VSR on the two DocBank test image you provide in the repository (78.tar_1604.08865.gz_main_0_ori.jpg 133.tar_1607.04116.gz_nonlinear-xxx_6_ori.jpg), and encountered the following exception:

Exception has occurred: ValueError
too many values to unpack (expected 4)
  File "/my/path/to/DAVAR-Lab-OCR/davarocr/davar_layout/models/embedding/bertgrid_embedding.py", line 74, in forward
    batch_b, _, batch_h, batch_w = img.size()
  File "/my/path/to/DAVAR-Lab-OCR/davarocr/davar_layout/models/vsr/vsr.py", line 418, in simple_test
    bertgrid = self.bertgrid_embedding(img, gt_bboxes[0], gt_texts[0])
  File "/my/path/to/DAVAR-Lab-OCR/davarocr/davar_common/apis/test.py", line 55, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/my/path/to/DAVAR-Lab-OCR/tools/test.py", line 240, in main
    outputs = single_gpu_test(model, data_loader, args.show, args.show_dir,
  File "/my/path/to/DAVAR-Lab-OCR/tools/test.py", line 271, in <module>
    main()

It seems that this particular img is of size [1,1,3,800,608], which is, I think, due to the img falsely nesting in a list. The gt_bboxes[0] also seems to have the same problem. My workaround was to force img=img[0] and gt_bboxes=gt_bboxes[0], which lifted the exception and the test script was executed successfully. However, the output result image looks as follows: 78 tar_1604 08865 gz_main_0_ori 133 tar_1607 04116 gz_nonlinear-xxx_6_ori

Is this behaviour expected? I imagined a much cleaner visualisation, and I thought e.g. the nested boxes in the figure of the 78.tar image should be resolved by your relational module. If not expected, what other info may I provide for requesting your assistance on identifying the issue?

Thank you for your excellent work, btw.

cs-yeung76 commented 2 years ago

Just a follow-up: the ValueError problem is resolved after a system reboot (have no idea why - perhaps it is cuda related because pytorch failed to detect cuda before the reboot).

The visualisation issue persists though - could the model be (falsely?) recognising content_ann rather than content_ann2? I looked into the json files in the demo/text_layout/datalist/DocBank/Annos folder, and it seems that the visualised content here is similar to the bboxes described in content_ann instead of the more coarse content_ann2. If helpful, I am executing with the trained model downloaded per your instruction at https://one.hikvision.com/#/link/76YItjTJkFNFMC0VNEK9; the model is named docbank_x101-eb65a9b1.pth.

volcano1995 commented 2 years ago

Hi, Thank you for your attention to our work. Indeed, our visualization is for content_ann. Since VSR uses multi-modal features, the annotations in content_ann actually provide token-level annotations, which are then used to extract semantic features, while content_ann2 is layout-level ground-truth annotations , you can download the model in link provided by us for training.

cs-yeung76 commented 2 years ago

Hi @volcano1995 , thank you for your respond! May I ask if it is feasible to perform layout-level inference with DocBank dataset? DocBank has more labels than PubLayNet, and some of which are desirable in my research (e.g. captions). If the current model docbank_x101-eb65a9b1.pth downloaded from your link can only perform token-level visualisation, do you provide another version that infer layouts rather than tokens for DocBank dataset?

If not and we have to train the model ourselves, I wonder if you could offer some insights on how to do so, as I am still struggling with attempting to train such a model. I changed the docbank_x101.py config file according to publaynet_x101.py as follows:

Line GCN (line_roi_extractor and line_gcn_head) is replaced with rpn_head and roi_head from publaynet_x101.py, with all num_classes altered to 13.
mask_roi_extractor and mask_head are also copy-pasted from docbank_x101.py to publaynet_x101.py, with num_classes=13
train_cfg and test_cfg are mirrored from publaynet_x101.py
In train_pipeline, the MMLALoadAnnotations dict is populated with three items with_bbox_2, with_poly_mask_2, with_label_2, all of which are set True; 'gt_bboxes_2', 'gt_labels_2', 'gt_masks_2' are appended to DavarCollect keys
Optimiser is switched from Adam to SGD
avg_f1/F1-score metrics are changed to bbox_mAP/bbox

Do these alternations suffice, or is there something else I have to change as well?

I did try to run the altered train script, which apparently didn't work. Below are some warning excerpted

2022-08-23 12:40:22,906 - davarocr - INFO - Set random seed to 42, deterministic: False
INFO:davarocr:Set random seed to 42, deterministic: False
fatal: not a git repository (or any of the parent directories): .git
2022-08-23 12:40:27,741 - davarocr - INFO - load checkpoint from /path/to/DAVAR-Lab-OCR/demo/text_layout/VSR/common/mask_rcnn_x101_64x4d_fpn_1x_coco_20200201-9352eb0d_with_semantic-0a3fbddb.pth
INFO:davarocr:load checkpoint from /path/to/DAVAR-Lab-OCR/demo/text_layout/VSR/common/mask_rcnn_x101_64x4d_fpn_1x_coco_20200201-9352eb0d_with_semantic-0a3fbddb.pth
2022-08-23 12:40:27,742 - davarocr - INFO - Use load_from_local loader
INFO:davarocr:Use load_from_local loader
2022-08-23 12:40:27,982 - davarocr - WARNING - The model and loaded state dict do not match exactly

size mismatch for backbone_semantic.conv1.weight: copying a param with shape torch.Size([64, 3, 7, 7]) from checkpoint, the shape in current model is torch.Size([64, 64, 7, 7]).
size mismatch for rpn_head.rpn_cls.weight: copying a param with shape torch.Size([3, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([7, 256, 1, 1]).
size mismatch for rpn_head.rpn_cls.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([7]).
size mismatch for rpn_head.rpn_reg.weight: copying a param with shape torch.Size([12, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([28, 256, 1, 1]).
size mismatch for rpn_head.rpn_reg.bias: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([28]).
unexpected key in source state_dict: neck_semantic.lateral_convs.0.conv.weight, neck_semantic.lateral_convs.0.conv.bias, ...
missing keys in source state_dict: bertgrid_embedding.embedding.weight, multimodal_feat_merge.alpha_proj.0.weight, ...

which entails tensor shape mismatches with backbone, neck and head.

Beyond the above a KeyError is raised when loading DocBank layout:

File "/path/to/DAVAR-Lab-OCR/davarocr/davar_layout/datasets/pipelines/mm_layout_loading.py", line 212, in _load_polymasks
    ori_masks = results['ann_info_2']['segboxes']
KeyError: 'segboxes'

It seems that PubLayNet dataset has an extra segboxes item in ann_info_2 that DocBank does not, which encodes polygon segmentation information. As DocBank does not provide this information, I guess we have to omit the mask_roi layers - would it cause a huge accuracy loss, and if so, are there ways to compensate?

After omitting the segboxes input and mask layers, it seems that the model takes tons of GPU memory to train - I am using a single 10GB RTX3080 and reduced sample_per_gpu to 1 (which leads to a batch size of sample_per_gpu*gpu_num=1 image, I hope), but keep encountering CUDA out of memory issue in the ResNeXt semantic backbone. Would you be able to reveal an estimate on the GPU memory required to train?

All in all, shedding some light on how to train a layout-level DocBank model would be much appreciated.

Also as a side question, I notice that you used BERTgrid embedding for DocBank rather than the CharGrid+SentGrid embedding you described in your manuscript; why? Do those perform equally well?

UPDATE: Didn't solve the above, but training could be started with changing depth of ResNeXt from 101 to 50 (which would inevitably cost accuracy I presume), which actually brought a new problem about evaluation. It seems that only F1 and acc metrics are implemented for docbank, but layout-level evaluation requires bbox metric as in publaynet; is there a easy way of evaluating docbank with bbox metric, as the implementation for publaynet is based on coco?

hikopensource / DAVAR-Lab-OCR

Problem with testing VSR #109