ibm-aur-nlp / PubLayNet

Other
915 stars 164 forks source link

a dirty bbox data in val.json? #17

Closed jianqiaomo closed 4 years ago

jianqiaomo commented 4 years ago

Hi! I found a strange bbox data in: val.json ['annotations'][914]['bbox'] = 50.73, 89.65, 498.14, 0.0

20200219112158

Does the bbox mean "x_left, y_up, w, h"? But why there is a "0.0"? Thanks!

phamquiluan commented 4 years ago

Same problem. I just decided to ignore it when loading annotations :smile:

image

zhxgj commented 4 years ago

Thank you all for reporting this issue. Let me fix it asap.

zhxgj commented 4 years ago

I have removed the image 342203 from the dev annotation. Please download the revised dev annotation json below. dev.json.zip

jianqiaomo commented 4 years ago

Thank you very much! In fact, I also found some other data:

[annotations] [ 915 ]['bbox'] = [50.73, 89.65, 498.14, 0.0] [annotations] [ 2511 ]['bbox'] = [308.61, 100.31, 240.1, 0.0] [annotations] [ 18686 ]['bbox'] = [94.53, 284.82, 406.21, 0.0] [annotations] [ 19318 ]['bbox'] = [50.73, 89.65, 498.14, 0.0] [annotations] [ 20288 ]['bbox'] = [51.04, 151.81, 238.11, 0.0] [annotations] [ 20442 ]['bbox'] = [51.04, 320.02, 493.24, 0.0] [annotations] [ 21049 ]['bbox'] = [51.04, 292.81, 493.24, 0.0] [annotations] [ 21171 ]['bbox'] = [51.04, 112.11, 493.24, 0.0] [annotations] [ 34791 ]['bbox'] = [306.14, 215.29, 246.9, 0.0] [annotations] [ 35141 ]['bbox'] = [42.52, 74.9, 246.35, 0.0] [annotations] [ 36214 ]['bbox'] = [308.66, 98.04, 240.2, 0.0] [annotations] [ 43200 ]['bbox'] = [50.73, 416.2, 498.14, 0.0] [annotations] [ 47565 ]['bbox'] = [164.41, 111.79, 510.57, 0.0] [annotations] [ 47595 ]['bbox'] = [101.73, 111.18, 391.81, 0.0] [annotations] [ 64952 ]['bbox'] = [308.66, 89.65, 240.2, 0.0] [annotations] [ 65220 ]['bbox'] = [50.73, 423.2, 498.13, 0.0] [annotations] [ 77743 ]['bbox'] = [286.3, 121.93, 70.87, 0.0]

[annotations] [ 26038 ] ['bbox'] = [134.76, 68.22, 0.0, 0.0] [annotations] [ 90133 ] ['bbox'] = [47.03, 651.98, 515.54, 0.0] [annotations] [ 150521 ] ['bbox'] = [50.35, 64.0, 510.5, 0.0] [annotations] [ 150531 ] ['bbox'] = [50.35, 64.0, 510.5, 0.0] [annotations] [ 167204 ] ['bbox'] = [62.0, 562.79, 496.0, 0.0] [annotations] [ 243124 ] ['bbox'] = [74.43, 595.22, 472.95, 0.0] [annotations] [ 272739 ] ['bbox'] = [46.49, 675.45, 504.0, 0.0] [annotations] [ 402514 ] ['bbox'] = [60.0, 501.7, 498.0, 0.0] [annotations] [ 423968 ] ['bbox'] = [52.16, 605.14, 490.89, 0.0] [annotations] [ 782584 ] ['bbox'] = [49.04, 624.14, 478.21, 0.0] [annotations] [ 1127619 ] ['bbox'] = [54.42, 93.83, 471.4, 0.0] [annotations] [ 1202730 ] ['bbox'] = [56.69, 495.57, 479.9, 0.0] [annotations] [ 1211904 ] ['bbox'] = [48.0, 652.92, 498.0, 0.0] [annotations] [ 1307067 ] ['bbox'] = [43.6, 566.58, 508.85, 0.0] [annotations] [ 1344996 ] ['bbox'] = [58.88, 567.97, 478.94, 0.0] [annotations] [ 1393908 ] ['bbox'] = [40.72, 666.86, 513.84, 0.0] [annotations] [ 1580676 ] ['bbox'] = [46.49, 573.8, 504.03, 0.0] [annotations] [ 1580688 ] ['bbox'] = [46.52, 639.81, 504.0, 0.0] [annotations] [ 1589957 ] ['bbox'] = [36.85, 663.77, 521.57, 0.0] [annotations] [ 1598952 ] ['bbox'] = [51.02, 596.88, 495.35, 0.0] [annotations] [ 1695215 ] ['bbox'] = [46.0, 681.92, 508.0, 0.0] [annotations] [ 1695382 ] ['bbox'] = [58.0, 600.79, 508.0, 0.0] [annotations] [ 1715796 ] ['bbox'] = [40.5, 623.0, 504.0, 0.0] [annotations] [ 1854303 ] ['bbox'] = [34.11, 577.96, 507.96, 0.0] [annotations] [ 1861141 ] ['bbox'] = [39.0, 659.75, 516.0, 0.0] [annotations] [ 1870418 ] ['bbox'] = [55.14, 119.28, 0.0, 94.14] [annotations] [ 1908434 ] ['bbox'] = [51.68, 512.36, 240.0, 0.0] [annotations] [ 1908449 ] ['bbox'] = [51.68, 597.36, 240.0, 0.0] [annotations] [ 1917572 ] ['bbox'] = [43.94, 498.96, 249.19, 0.0] [annotations] [ 2026931 ] ['bbox'] = [59.84, 656.54, 478.21, 0.0] [annotations] [ 2047182 ] ['bbox'] = [109.29, 643.41, 42.79, 0.0] [annotations] [ 2047186 ] ['bbox'] = [102.83, 640.93, 44.42, 0.0] [annotations] [ 2047210 ] ['bbox'] = [110.92, 649.15, 43.36, 0.0] [annotations] [ 2047214 ] ['bbox'] = [113.75, 649.24, 42.48, 0.0] [annotations] [ 2072330 ] ['bbox'] = [46.49, 530.77, 504.0, 0.0] [annotations] [ 2072567 ] ['bbox'] = [65.2, 619.89, 464.88, 0.0] [annotations] [ 2149187 ] ['bbox'] = [46.49, 655.1, 504.0, 0.0] [annotations] [ 2189686 ] ['bbox'] = [69.73, 517.11, 471.69, 0.0] [annotations] [ 2189768 ] ['bbox'] = [69.73, 581.68, 471.69, 0.0] [annotations] [ 2347748 ] ['bbox'] = [48.31, 88.75, 490.17, 0.0] [annotations] [ 2347767 ] ['bbox'] = [48.31, 577.75, 490.17, 0.0] [annotations] [ 2376011 ] ['bbox'] = [44.79, 665.66, 504.0, 0.0] [annotations] [ 2422653 ] ['bbox'] = [58.59, 541.51, 478.94, 0.0] [annotations] [ 2474524 ] ['bbox'] = [30.74, 630.11, 521.44, 0.0] [annotations] [ 2474528 ] ['bbox'] = [30.74, 647.27, 521.44, 0.0] [annotations] [ 2613360 ] ['bbox'] = [69.73, 545.46, 471.69, 0.0] [annotations] [ 2613531 ] ['bbox'] = [69.73, 597.61, 471.69, 0.0] [annotations] [ 2613617 ] ['bbox'] = [69.73, 628.79, 471.69, 0.0] [annotations] [ 2665702 ] ['bbox'] = [40.72, 609.86, 513.84, 0.0] [annotations] [ 2866969 ] ['bbox'] = [53.86, 651.87, 480.19, 0.0] [annotations] [ 2866979 ] ['bbox'] = [53.86, 652.44, 480.19, 0.0] [annotations] [ 2866991 ] ['bbox'] = [53.86, 651.59, 480.19, 0.0] [annotations] [ 2883891 ] ['bbox'] = [60.95, 369.02, 480.47, 0.0] [annotations] [ 2938869 ] ['bbox'] = [566.78, 94.05, 0.0, 166.24] [annotations] [ 2947003 ] ['bbox'] = [45.0, 630.51, 504.0, 0.0] [annotations] [ 2947011 ] ['bbox'] = [45.0, 610.51, 504.0, 0.0] [annotations] [ 3042870 ] ['bbox'] = [57.09, 508.95, 488.69, 0.0] [annotations] [ 3148454 ] ['bbox'] = [51.68, 592.36, 240.0, 0.0] [annotations] [ 3165283 ] ['bbox'] = [54.0, 604.58, 504.0, 0.0] [annotations] [ 3165441 ] ['bbox'] = [54.0, 637.26, 498.0, 0.0]

However, we can filter them out when we load the annotations. So I think they won't have a great impact on us.

zhxgj commented 4 years ago

Thanks for identifying these problematic bboxes. When you filter them out, I would recommend to filter out the entire page from the dataset, rather than just the bboxes. Because if you only ignore the bbox but keep the page, the page may contain wrong/missed annotations.