gkovacs / pdfocr

Adds text to PDF files using the cuneiform OCR software
MIT License
325 stars 49 forks source link

Cropped pages after run of pdfocr #42

Open InCaseItWorks opened 3 years ago

InCaseItWorks commented 3 years ago

Hello

There seems to be a problem with the final step in the pdfocr script. Running pdfocr produces a heavily cropped pdf file. Most of each page is missing.

Actual Result: Cropped pdf file

Expected Result: Pdf file in original dimensions

Description: I'm running the command in a script like so: pdfocr -i $FILENAME.tmp.pdf -l deu -w . -k -o $FILENAME.pdf

Turning the -k option on shows me the "merged.pdf" file in the working directory ("pdfocr") which is still perfectly fine, size, OCRed text, and all. But the final pdf is heavily cropped.

Comparing the pdf metadata of the final file and "merged.pdf" with "pdftk merged.pdf dump_data" shows the differences in dimensions.

Commenting out line 374 in "pdfocr.rb" prevents the final file from being created and the metadata from being updated, so up to this point everything seems to work properly. The line is:

sh "pdftk", tmp+'/merged.pdf', "update_info", tmp+'/pdfinfo.txt', "output", outfile

Unfortunately, I don't 'speak' Ruby, so I don't know what I'd be doing if I were to edit the pdfocr script. I'm using a workaround now by simply deleting the final file and moving "merged.pdf".

My System: Ubuntu 20.10, pdfocr 0.1.4, ruby 2.7.1p83, pdftk 3.1.1

If there's any further information I can provide, please let me know.