ibm-aur-nlp / PubLayNet

Other
915 stars 164 forks source link

Do you supply the textline-ocr informations for publaynet? #19

Open phamquiluan opened 4 years ago

phamquiluan commented 4 years ago

If it's extracted from PDF and XML format, I am wondering did you extract text line and OCR information too? If not, could it be done in the future?

zhxgj commented 4 years ago

We did not save the content of text lines, because we focused on the layout. It is surely doable, even with the PDF themselves. You can us PDFMinner to get textlines and the contents from PDF. Then you can convert the PDF to images to train/test OCR model.

phamquiluan commented 4 years ago

@zhxgj I think it will be great if you extract text line and OCR information for this dataset. It will lead to something called the "End-to-end OCR system" that can be trained on a unified dataset. The results not only contain layout information but also supply OCR information.

zhxgj commented 4 years ago

@phamquiluan That is an interesting idea. Thanks for sharing. I am not an expert on OCR. What is the best way to provide the information for OCR? I am thinking of something like for each title/text/list element, adding a list of text lines of the content.

phamquiluan commented 4 years ago

@zhxgj If your PDF files are of good quality (exported from LaTeX or MsWords, etc,..), you can use pdftotext to extract the content and the layout information (contains text lines information and moreover). I assume that you use Ubuntu or Linux, you can access the manual via man pdftotext, please give attention to '-bbox-layout' parameter.

And, I see there are some regions have not labeled as text in Publaynet dataset (e.g. page number, header, footer?), since those things also text and have content, please label it as well if you intend to provide text line information, please :pray:

About the hierarchy of those boxes (text, list, paragraph), in the scene of a very complex and unpredictable layout format, multiple components could be conflicted together (like a picture contain text line?). I prefer a "flat" hierarchical dataset, a text line does not need to be included in a list or a paragraph. If someone needs those information, it could be done via the rule-based method (checking IoU or something).

Balajanovski commented 4 years ago

If you want to do OCR with PubLayNet, I suggest using the output bounding boxes to crop parts of your pdf out and feed them to an OCR engine like Tesseract.

menglin0320 commented 1 year ago

Hi it's been years since this question was proposed, did you guys find a good solution?