kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
216 stars 70 forks source link

One PDF where random strings are dropped (depending on filename length?) #124

Open kcstrong opened 3 years ago

kcstrong commented 3 years ago

We're currently testing pdfalto. Specifically, we're converting a lot of PDFs to HTML via the XML output of pdfalto (as we were not quite satisfied with the result of any of the pdftohtml tools we tested). Most of the results are excellent, though we are finding some issues. We're using pdfalto v0.5 compiled on Windows via Cygwin as per the instructions.

In this case different strings are being dropped from a PDF (only the one so far), always beginning on or around page 45, apparently depending on the length of the filename. I stumbled upon this observation by accident. E.g. strings are dropped from foo.pdf. If I rename the file foo-99.pdf different strings are dropped. If I rename the file bar.pdf the same strings are dropped as foo.pdf. I've renamed and processed the same file at least ten times and observed that each result differs from every other except where the filename was the same length.

To whomever wants to test this: I can send you the file, but am bound by policy to protect our copyright. If there's a way I can send you the file privately that would be preferable.

Thanks

kermitt2 commented 3 years ago

Hello @kcstrong !

Thanks a lot for the tests and reporting the issues.

I would be happy to try to reproduce the problem with your file and investigate it. You can send the file to my private address, that you find here, first email.