Closed sandipan1 closed 4 years ago
In case of an image-based pdf, you will have to use the python package ocrmypdf (https://ocrmypdf.readthedocs.io/en/latest/) first and then run Camelot with Yolo on the output
Thanks ,ocrmypdf works well typed image-based PDFs. However I did not find good results on handwritten text . Any suggestions on this ?
You can try this repo (https://github.com/Breta01/handwriting-ocr) but I'm not sure you will be able to get the tables.
Thanks I'll check it out. Also I am trying to detect empty and filled checkboxes and circles. Do you think a similar pipeline for checkboxes can be made like your current pipeline for table detection using YOLOv3 ?
Yes, You could train a Yolo algorithm on a 4-class labeled data: empty checkbox, empty circle, filled checkbox and filled circles. The algorithm should be capable of detecting these 4 classes.
Thanks for the suggestion.
Thanks for the suggestion.
I am not able to see any output if the pdf originally contain scanned images. Also there is a UserWarning
UserWarning: page-1 is image-based, camelot only works on text-based pages. [stream.py:443]
Any way I can make it work on image based pdfs too?