Open de-code opened 3 years ago
I noticed the same problem, and I thought that this problem did not occur on Linux. Now, this seems to prove the opposite.
Maybe related: #95
Thank you for the error case and #108 - the error was introduced with the processing of line numbers... this is a priority on my next iteration on pdfalto.
I was trying to track down why running GROBID locally produced different results to when compared to running it via Docker. In the end it seems that the output of
pdfalto
can change randomly.For example given 262469v1.pdf (I attached the exact version I was using).
262469v1
is from the biorxiv 10k test dataset (please do not use it for training purpose).262469v1
is one of the documents with spacing issues. But I would still expect it to produce the same results.When I run the following command:
Then the result doesn't seem to be exactly the same. There are some characters that appear to be randomly omitted.
md5sum
forpdfalto
is871e22e83833f773dae2b2f5e70df8ae
(Linux x64).262469v1.pdf.gz 262469v1_0.6.1_run_1_formatted.lxml.gz 262469v1_0.6.1_run_2_formatted.lxml.gz
(I formatted the results using
xmllint
)