ibm-aur-nlp / PubLayNet

Other
900 stars 165 forks source link

Mismatch between image and PDF #33

Open gagaein opened 3 years ago

gagaein commented 3 years ago

Firstly, thank you for your useful dataset. I have download Publaynet in forms in image and PDF. But I noticed that the image and PDF of the same page are NOT the same size. For example, the size of PDF file is 600.05792, but the JPG image's size is 602792. So the annotation should be sightly different for these 2 type of files. How can I solve this problem? Thank you again!

ajjimeno commented 3 years ago

Hi gagaein, we prepared the data set to identify the layout from images directly. We do not have the data to resize the images to match the original PDF page. I am wondering if a scaling factor between the image and the PDF page could be estimated and then the annotations scaled back to the PDF page.