kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
209 stars 67 forks source link

Randomly omitted characters #107

Open de-code opened 3 years ago

de-code commented 3 years ago

I was trying to track down why running GROBID locally produced different results to when compared to running it via Docker. In the end it seems that the output of pdfalto can change randomly.

For example given 262469v1.pdf (I attached the exact version I was using). 262469v1 is from the biorxiv 10k test dataset (please do not use it for training purpose).

262469v1 is one of the documents with spacing issues. But I would still expect it to produce the same results.

When I run the following command:

docker run --rm \
  -v $PWD/data:/data \
  lfoppiano/grobid:0.6.1 \
  "/opt/grobid/grobid-home/pdf2xml/lin-64/pdfalto" \
  -noImageInline -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 \
  "/data/pdf/262469v1.pdf" \
  "/data/pdf/lxml-docker-0.6.1-direct/262469v1.lxml"

Then the result doesn't seem to be exactly the same. There are some characters that appear to be randomly omitted.

md5sum for pdfalto is 871e22e83833f773dae2b2f5e70df8ae (Linux x64).

262469v1.pdf.gz 262469v1_0.6.1_run_1_formatted.lxml.gz 262469v1_0.6.1_run_2_formatted.lxml.gz

(I formatted the results using xmllint)

lfoppiano commented 3 years ago

I noticed the same problem, and I thought that this problem did not occur on Linux. Now, this seems to prove the opposite.

Maybe related: #95

kermitt2 commented 3 years ago

Thank you for the error case and #108 - the error was introduced with the processing of line numbers... this is a priority on my next iteration on pdfalto.