Open cs-yeung76 opened 2 years ago
Just a follow-up: the ValueError problem is resolved after a system reboot (have no idea why - perhaps it is cuda related because pytorch failed to detect cuda before the reboot).
The visualisation issue persists though - could the model be (falsely?) recognising content_ann
rather than content_ann2
? I looked into the json files in the demo/text_layout/datalist/DocBank/Annos
folder, and it seems that the visualised content here is similar to the bboxes described in content_ann
instead of the more coarse content_ann2
. If helpful, I am executing with the trained model downloaded per your instruction at https://one.hikvision.com/#/link/76YItjTJkFNFMC0VNEK9
; the model is named docbank_x101-eb65a9b1.pth
.
Hi, Thank you for your attention to our work. Indeed, our visualization is for content_ann. Since VSR uses multi-modal features, the annotations in content_ann actually provide token-level annotations, which are then used to extract semantic features, while content_ann2 is layout-level ground-truth annotations , you can download the model in link provided by us for training.
Hi @volcano1995 , thank you for your respond! May I ask if it is feasible to perform layout-level inference with DocBank dataset? DocBank has more labels than PubLayNet, and some of which are desirable in my research (e.g. captions). If the current model docbank_x101-eb65a9b1.pth
downloaded from your link can only perform token-level visualisation, do you provide another version that infer layouts rather than tokens for DocBank dataset?
If not and we have to train the model ourselves, I wonder if you could offer some insights on how to do so, as I am still struggling with attempting to train such a model. I changed the docbank_x101.py
config file according to publaynet_x101.py
as follows:
line_roi_extractor
and line_gcn_head
) is replaced with rpn_head
and roi_head
from publaynet_x101.py
, with all num_classes
altered to 13
. mask_roi_extractor
and mask_head
are also copy-pasted from docbank_x101.py
to publaynet_x101.py
, with num_classes=13
train_cfg
and test_cfg
are mirrored from publaynet_x101.py
train_pipeline
, the MMLALoadAnnotations
dict is populated with three items with_bbox_2
, with_poly_mask_2
, with_label_2
, all of which are set True
; 'gt_bboxes_2', 'gt_labels_2', 'gt_masks_2'
are appended to DavarCollect
keysavg_f1
/F1-score
metrics are changed to bbox_mAP
/bbox
Do these alternations suffice, or is there something else I have to change as well?
I did try to run the altered train script, which apparently didn't work. Below are some warning excerpted
2022-08-23 12:40:22,906 - davarocr - INFO - Set random seed to 42, deterministic: False
INFO:davarocr:Set random seed to 42, deterministic: False
fatal: not a git repository (or any of the parent directories): .git
2022-08-23 12:40:27,741 - davarocr - INFO - load checkpoint from /path/to/DAVAR-Lab-OCR/demo/text_layout/VSR/common/mask_rcnn_x101_64x4d_fpn_1x_coco_20200201-9352eb0d_with_semantic-0a3fbddb.pth
INFO:davarocr:load checkpoint from /path/to/DAVAR-Lab-OCR/demo/text_layout/VSR/common/mask_rcnn_x101_64x4d_fpn_1x_coco_20200201-9352eb0d_with_semantic-0a3fbddb.pth
2022-08-23 12:40:27,742 - davarocr - INFO - Use load_from_local loader
INFO:davarocr:Use load_from_local loader
2022-08-23 12:40:27,982 - davarocr - WARNING - The model and loaded state dict do not match exactly
size mismatch for backbone_semantic.conv1.weight: copying a param with shape torch.Size([64, 3, 7, 7]) from checkpoint, the shape in current model is torch.Size([64, 64, 7, 7]).
size mismatch for rpn_head.rpn_cls.weight: copying a param with shape torch.Size([3, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([7, 256, 1, 1]).
size mismatch for rpn_head.rpn_cls.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([7]).
size mismatch for rpn_head.rpn_reg.weight: copying a param with shape torch.Size([12, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([28, 256, 1, 1]).
size mismatch for rpn_head.rpn_reg.bias: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([28]).
unexpected key in source state_dict: neck_semantic.lateral_convs.0.conv.weight, neck_semantic.lateral_convs.0.conv.bias, ...
missing keys in source state_dict: bertgrid_embedding.embedding.weight, multimodal_feat_merge.alpha_proj.0.weight, ...
which entails tensor shape mismatches with backbone, neck and head.
Beyond the above a KeyError
is raised when loading DocBank layout:
File "/path/to/DAVAR-Lab-OCR/davarocr/davar_layout/datasets/pipelines/mm_layout_loading.py", line 212, in _load_polymasks
ori_masks = results['ann_info_2']['segboxes']
KeyError: 'segboxes'
It seems that PubLayNet dataset has an extra segboxes
item in ann_info_2
that DocBank does not, which encodes polygon segmentation information. As DocBank does not provide this information, I guess we have to omit the mask_roi
layers - would it cause a huge accuracy loss, and if so, are there ways to compensate?
After omitting the segboxes
input and mask
layers, it seems that the model takes tons of GPU memory to train - I am using a single 10GB RTX3080 and reduced sample_per_gpu
to 1 (which leads to a batch size of sample_per_gpu*gpu_num=
1 image, I hope), but keep encountering CUDA out of memory issue in the ResNeXt semantic backbone. Would you be able to reveal an estimate on the GPU memory required to train?
All in all, shedding some light on how to train a layout-level DocBank model would be much appreciated.
Also as a side question, I notice that you used BERTgrid embedding for DocBank rather than the CharGrid+SentGrid embedding you described in your manuscript; why? Do those perform equally well?
UPDATE: Didn't solve the above, but training could be started with changing depth of ResNeXt from 101 to 50 (which would inevitably cost accuracy I presume), which actually brought a new problem about evaluation. It seems that only F1 and acc metrics are implemented for docbank, but layout-level evaluation requires bbox metric as in publaynet; is there a easy way of evaluating docbank with bbox metric, as the implementation for publaynet is based on coco?
I was testing VSR on the two DocBank test image you provide in the repository (78.tar_1604.08865.gz_main_0_ori.jpg 133.tar_1607.04116.gz_nonlinear-xxx_6_ori.jpg), and encountered the following exception:
It seems that this particular
img
is of size[1,1,3,800,608]
, which is, I think, due to theimg
falsely nesting in a list. Thegt_bboxes[0]
also seems to have the same problem. My workaround was to forceimg=img[0]
andgt_bboxes=gt_bboxes[0]
, which lifted the exception and the test script was executed successfully. However, the output result image looks as follows:Is this behaviour expected? I imagined a much cleaner visualisation, and I thought e.g. the nested boxes in the figure of the 78.tar image should be resolved by your relational module. If not expected, what other info may I provide for requesting your assistance on identifying the issue?
Thank you for your excellent work, btw.