ibm-aur-nlp / PubTabNet

Other
380 stars 79 forks source link

The y coordinate value of cell bbox seems to be inaccurate #25

Open qyhou opened 2 years ago

qyhou commented 2 years ago

Thank you for providing the large-scale dataset.

When converting the html to a kind of split structure, I found the y coordinate value of cell bbox seems to be inaccurate.

eg. PMC5842743_009_00, which is a 11x6 table. PMC5842743_009_00 A03 line: [2, 65, 19, 76], [31, 65, 46, 76], [68, 65, 82, 76], [110, 65, 133, 76], [165, 65, 176, 76], 211, 65, 228, 76] A04 line: [2, 78, 20, 89], [31, 78, 46, 89], [71, 75, 79, 90], [118, 75, 125, 90], [167, 75, 174, 90], [216, 75, 223, 90] Obviously y1 of the upper cell is greater than y0 of the lower cell ( 76 > 75 ). PMC5842743_009_00

I randomly checked 100 tables in training set and discovered 37 instances have this peculiarity.

Thanks