Can the be a page number option? Tesseract segfaults on specific pages.

omkar-kumbhar commented 6 years ago

Hey there J,

I had a couple of PDFs which were segfaulting at specific pages. I think this still is an unresolved issue with tesseract. Please find the log below.

Task enters queue = 'ocrmypdf._pipeline.select_image_layer' DEBUG - 1: convert DEBUG - 1: convert done Completed Task = 'ocrmypdf._pipeline.select_image_layer' DEBUG - ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.ocr.png', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.text', 'pdf', 'txt'] WARNING - 1: [tesseract] unsure about page orientation WARNING - 1: [tesseract] lots of diacritics - possibly poor OCR ERROR - 1: [tesseract] contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511 DEBUG -

Original exception:

Exception #1
  'subprocess.CalledProcessError(Command '['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.ocr.png', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.text', 'pdf', 'txt']' died with <Signals.SIGSEGV: 11>.)' raised in ...
   Task = def ocrmypdf._pipeline.ocr_tesseract_textonly_pdf(...):
   Job  = [[.../000001.ocr.png] -> [.../000001.text.pdf, .../000001.text.txt], <LoggingProxy>, <ocrmypdf._jobcontext.JobContext>]

Traceback (most recent call last):
  File "/home/ansible/anaconda3/lib/python3.6/site-packages/ruffus/task.py", line 748, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/ansible/anaconda3/lib/python3.6/site-packages/ruffus/task.py", line 566, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/ansible/anaconda3/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 727, in ocr_tesseract_textonly_pdf
    log=log)
  File "/home/ansible/anaconda3/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 359, in generate_pdf
    raise e from e
  File "/home/ansible/anaconda3/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 345, in generate_pdf
    timeout=timeout)
  File "/home/ansible/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/home/ansible/anaconda3/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.ocr.png', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.text', 'pdf', 'txt']' died with <Signals.SIGSEGV: 11>.

ERROR - Error occurred while running this command: (Command '['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.ocr.png', '/tmp/com.github.ocrmypdf.v1rhtscn/000001.text', 'pdf', 'txt']' died with <Signals.SIGSEGV: 11>.)

Because of Segfault I am unable to process other pages with perfectly clear visibility. I have made a crude implementation where you split the segfaulting pdf and run OCRmyPDF on each page, and then merge PDFs. This takes a lot of time.

Assuming that ocrmypdf._pipeline.ocr_tesseract_textonly_pdf task gives an exception of specific image where it failed, can there be a page number option which can be leveraged to re-run OCR on those specific pages which do not have an issue?

Something like: ocrmypdf --page 1-30,32-34

PS. Sorry I cannot be sharing the PDFs but I did read a previous thread where you did mention about an NDA which might help in such cases. If you can suggest me a workable solution from the log I shared then its fine otherwise we can work something out.

Thanks and keep up the good work.

jbarlow83 commented 6 years ago

It's page 1 on that file (from the 000001.png).

You can use qpdf --pages to split pages out of a file, or my pikepdf project if you want to do it programmatically. Both are dependencies of ocrmypdf so they should be available.

If you want me to sign an NDA, then we'll need a contract for services as well. I realize this may be inconvenient, but I draw a line here: if I am going to take on legal obligations to someone and expose myself to legal risks, then I need to be compensated. And I'm quite happy to do so – this program wouldn't be half as good as it is today without such contributions.

omkar-kumbhar commented 6 years ago

Thanks a lot for the reply.

Lets see what I can do about it. I shall mail you if there are more specific requirements.

ocrmypdf / OCRmyPDF

Can the be a page number option? Tesseract segfaults on specific pages. #302