rescribe / bookpipeline

Tools to process books in a cloud based pipeline system
https://rescribe.xyz/bookpipeline/
GNU General Public License v3.0
49 stars 4 forks source link

PDF specification #2

Open aethralis opened 6 months ago

aethralis commented 6 months ago

I have some issues with the pdf that rescribe creates. I'm using the latest version (1.2.0) and having trouble importing the produced text into r with pdftools. Error message is:

PDF error (142084): Unknown operator 'Inf'
PDF error (142084): Too few (0) args to 'Tz' operator

When repairing the pdf with ghostscript (with options -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress) it gives the following advice:

The following errors were encountered at least once while processing this file: missing white space after number error executing PDF token

Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF **** specification.

After repairing the file with gs the import into r with pdftools works fine.

nickjwhite commented 6 months ago

Thanks for finding this and sending such a helpful writeup! I think I've found and fixed the issue, though it only happens with some inputs, and I haven't found one which reproduces your issue. Are you able to test a build for me? If you use Linux, this is a test build I just made: https://rescribe.xyz/tmp/rescribe-fixissue2-v1 - If you'd rather some other OS, let me know and I can make a test for you (or you can build one yourself from the fixpdf-issue2 branch).

aethralis commented 6 months ago

Thank you for addressing this, I really appreciate it!

I tested the new build and 1) If I try rescribe-fixissue2-v1 without flags I get the following error:

2024/03/12 16:17:51 No getgbook found [tried getgbook], google book downloading will be disabled, either set -gbookcmd on the command line or use the official build which includes an embedded getgbook.
Error: Training files rescribev9_fast.traineddata or /tmp/tesseract3617539103/tessdata/rescribev9_fast.traineddata could not be opened.
Set the `-t` flag with path to a tesseract .traineddata file.
  1. When using the suggestion (and downloading lat.traineddata): ./rescribe-fixissue2-v1 -t lat.traineddata test/

The resulting pdf imports into r fine, but if I (just to see, if it gives any suggestions) repair it again with gs, I get the following message:

The following warnings were encountered at least once while processing this file:
    File has Embedded files which could not be preserved

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> �� <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

Thanks again!

nickjwhite commented 5 months ago

Thanks for checking and reporting that the original issue was fixed! I have just released v1.3.0, which includes the fix, and includes the embedded training data as other proper releases do (well done figuring out how to get that test build to work without it, by the way!)

Regarding the new issue you found, File has Embedded files which could not be preserved, I can't reproduce that on my end (yet). Opening a test PDF I created with gs test.pdf it just shows each page without complaining. I'm not very familiar with ghostscript, can you give more clues as to how to reproduce this please? It's possible this only occurs with some created PDFs, so if you are able if you could attach an example PDF which has the issue that would be helpful too.

aethralis commented 5 months ago

Thanks again! When looking with gs test.pdf it does indeed not give any errors, but when using gs -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -o out.pdf test.pdf then at least I get still the warnings. These are not showstoppers, but maybe worth to have a look, what causes them.