No result on image based pdfs

ismail-mebsout / Parsing-PDFs-using-YOLOV3

Parsing pdf tables using YOLOV3

113 stars 42 forks source link

No result on image based pdfs #1

Closed sandipan1 closed 4 years ago

sandipan1 commented 4 years ago

I am not able to see any output if the pdf originally contain scanned images. Also there is a UserWarning UserWarning: page-1 is image-based, camelot only works on text-based pages. [stream.py:443]

Any way I can make it work on image based pdfs too?

ismail-mebsout commented 4 years ago

In case of an image-based pdf, you will have to use the python package ocrmypdf (https://ocrmypdf.readthedocs.io/en/latest/) first and then run Camelot with Yolo on the output

sandipan1 commented 4 years ago

Thanks ,ocrmypdf works well typed image-based PDFs. However I did not find good results on handwritten text . Any suggestions on this ?

ismail-mebsout commented 4 years ago

You can try this repo (https://github.com/Breta01/handwriting-ocr) but I'm not sure you will be able to get the tables.

sandipan1 commented 4 years ago

Thanks I'll check it out. Also I am trying to detect empty and filled checkboxes and circles. Do you think a similar pipeline for checkboxes can be made like your current pipeline for table detection using YOLOv3 ?

ismail-mebsout commented 4 years ago

Yes, You could train a Yolo algorithm on a 4-class labeled data: empty checkbox, empty circle, filled checkbox and filled circles. The algorithm should be capable of detecting these 4 classes.

sandipan1 commented 4 years ago

Thanks for the suggestion.

sandipan1 commented 4 years ago

Thanks for the suggestion.