Closed omkar-kumbhar closed 6 years ago
It's page 1 on that file (from the 000001.png).
You can use qpdf --pages
to split pages out of a file, or my pikepdf project if you want to do it programmatically. Both are dependencies of ocrmypdf so they should be available.
If you want me to sign an NDA, then we'll need a contract for services as well. I realize this may be inconvenient, but I draw a line here: if I am going to take on legal obligations to someone and expose myself to legal risks, then I need to be compensated. And I'm quite happy to do so – this program wouldn't be half as good as it is today without such contributions.
Thanks a lot for the reply.
Lets see what I can do about it. I shall mail you if there are more specific requirements.
Hey there J,
I had a couple of PDFs which were segfaulting at specific pages. I think this still is an unresolved issue with tesseract. Please find the log below.
Task enters queue = 'ocrmypdf._pipeline.select_image_layer' DEBUG - 1: convert DEBUG - 1: convert done Completed Task = 'ocrmypdf._pipeline.select_image_layer' DEBUG - ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.ocr.png', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.text', 'pdf', 'txt'] WARNING - 1: [tesseract] unsure about page orientation WARNING - 1: [tesseract] lots of diacritics - possibly poor OCR ERROR - 1: [tesseract] contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511 DEBUG -
Original exception:
ERROR - Error occurred while running this command: (Command '['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.ocr.png', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.text', 'pdf', 'txt']' died with <Signals.SIGSEGV: 11>.)
Because of Segfault I am unable to process other pages with perfectly clear visibility. I have made a crude implementation where you split the segfaulting pdf and run OCRmyPDF on each page, and then merge PDFs. This takes a lot of time.
Assuming that ocrmypdf._pipeline.ocr_tesseract_textonly_pdf task gives an exception of specific image where it failed, can there be a page number option which can be leveraged to re-run OCR on those specific pages which do not have an issue?
Something like: ocrmypdf --page 1-30,32-34
PS. Sorry I cannot be sharing the PDFs but I did read a previous thread where you did mention about an NDA which might help in such cases. If you can suggest me a workable solution from the log I shared then its fine otherwise we can work something out.
Thanks and keep up the good work.