gkovacs / pdfocr

Adds text to PDF files using the cuneiform OCR software
MIT License
325 stars 49 forks source link

pdftk error: Unexpected Exception in open_reader() #21

Open shivams opened 9 years ago

shivams commented 9 years ago

For some PDF files, pdftk throws this error:

Error: Unexpected Exception in open_reader()
Unhandled Java Exception:

This bug has been reported on pdftk launchpad: https://bugs.launchpad.net/ubuntu/+source/pdftk/+bug/774052

It seems like the bug hasn't been fixed. Due to this bug, pdfocr.rb also fails on many occasions. However, there is a temporary solution that I have. The solution is something like this:

Sometimes, pdftk completely fails to read certain types of PDFs. However, if we read those PDFs using some other tool and then recreate them, then pdftk will read the newly created PDF just fine. E.g. we can use ghostscript to recreate pdf like this:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=newfile.pdf myfile.pdf

Now pdftk will read the newly created PDF file just fine.

If someone is willing to apply this solution, then it'd be really good. Otherwise I will make the changes myself and send a pull request.

PS: A sample file which fails to be read is given here: https://www.jstage.jst.go.jp/article/jsmec/45/3/45_3_730/_pdf

mcdlee commented 9 years ago

I met similar error under Windows environment if the path of PDF file contained "Non-Latin characters", such as Chinese. But if I move the PDF file to the path without Chinese, it works.

ahmad-elkomey commented 4 years ago

I met similar error under Windows environment if the path of PDF file contained "Non-Latin characters", such as Chinese. But if I move the PDF file to the path without Chinese, it works.

Thanks! That is a very useful comment. The path I had problem with had whitespace. I moved the files some other path that doesn't have whitespace.

mkyildiz01 commented 3 years ago

I met similar error under Windows environment if the path of PDF file contained "Non-Latin characters", such as Chinese. But if I move the PDF file to the path without Chinese, it works.

When I changed the path, I could also combine my files. Thank you!