Closed grantbarrett closed 2 years ago
Some followup. To try to determine where in the series of commands it was going wrong, I modified it in several ways. One, I had it process just one file at a time, as one idea is that it might be a memory error. Two, I disabled OCR and tried only to run the file optimization, since that seems to be where the error is happening. Three, I remade the PDFs using the JP2 images available at Archive.org, considering that perhaps the PDF itself was somehow at fault. I also enabled verbose logging to learn more about what was happening (and also because looking at a terminal that appears to be doing nothing for several hours is frustrating).
The command now looks like this:
find . -name '*.pdf' | parallel --tag -j 1 ocrmypdf -v1 --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}'
For the first file that has completed, this has avoided the error; however, image optimization did not improve the file, which is a surprise.
PS: the verbose logging showed what is probably a typo of "image" in lines like:
Skipping JPEG2000 iamge, xref 1927
Looks like the typo is in optimize.py at line 100.
The optimizer does not try to optimize JPEG2000 images, because JPEG2000 images are widely used in medical imaging and to my knowledge anyway, haven't been adopted in many other fields. Also, if you are using whole page color images, there won't be anything to compress with JBIG2 since there will not be monochrome images. (We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)
Your new method avoids the error entirely, because the optimizer has nothing to do, assuming it has something to do with the optimizer.
If you're able to, please run the original file with the -k
argument added, and then send a zip of the temporary folder directory. That will allow to examine the intermediate files that were present when the error occurred.
I should note that I made an error in my original post. The PDFs I linked to were not the issue. They were fine. In fact, after further testing, I confirmed the error occurs exclusively with some, but not all, PDFs made from JP2 files via img2pdf.
For example, I ran my commands on PDF documents of 100, 200, 300, 400, and 500 pages made from JP2 files and they were fine.
However, making a PDF out of the JP2 files of the book at https://archive.org/details/gamalnorskordbok00haeguoft/, which is 648 pages, consistently fails.
Here is that PDF. https://www.dropbox.com/s/o7q5a8u4nujzsf6/GamalNorsk-JP2.pdf?dl=0
I ran the commands with the -k switch. However, the resulting temp directory is 86GB. Is there a particular part of it you would most like to see? I'm enclosing a directory listing and the debug log as they may be most helpful.
I think you are right: there's something happening at the end when optimization is supposed to occur. The "metafix.pdf" file (10.3 GB, or else I would attach it) in the temp directory, is a fully formed PDF file that has OCRed text where it should be and seems to be correct in all respects. Created right after that is an empty "images" folder and a zero bytes "optimize.opt.pdf" file, after which the process failed and nothing further was done.
If I then run an optimization command on metafix.pdf, skipping OCR, ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}-SM.pdf'
I get the same sort of error that I opened this thread with.
I ran into this error as well. Here is my stack trace in case that helps:
File "/Users/elliotwaite/code/pdf_tools/extract_text.py", line 36, in main
ocrmypdf.ocr(
File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/api.py", line 337, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 375, in run_pipeline
exec_concurrent(context, executor)
File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 299, in exec_concurrent
pdf = post_process(pdf, context, executor)
File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 232, in post_process
pdf_out = metadata_fixup(pdf_out, context)
File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 812, in metadata_fixup
pdf.save(
File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/pikepdf/_methods.py", line 804, in save
self._save(
ValueError: integer out of range converting 3744557654 from a 8-byte signed type to a 4-byte signed type
Looks like it might be occurring in pikepdf.
@elliotwaite Can you try the following:
qpdf input.pdf output.pdf
with pikepdf.open('input.pdf') as p:
p.save('output.pdf')
Do you get a similar "integer out of range" error from either of these?
@grantbarrett I could not reproduce on Ubuntu using
# issue836.pdf is GamalNorsk-JP2.pdf
ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf issue836.pdf issue836_.pdf
Perhaps you ran out of disk space? What version of libqpdf is installed?
I haven't tested out using qpdf
or pikepdf
directly yet. The PDF is big (140.3 MB, 19,038 pages), so it takes a long time to run while eating up my CPU, and I need to use my computer for other work at the moment, but maybe I can test those out in the future.
Here's my version info:
# Name Version Build Channel
pikepdf 4.0.0 pypi_0 pypi
qpdf 10.3.2 hefd3b78_0 conda-forge
And here's my OS info:
I have 128 GB free disk space, so I don't think it was a disk space issue.
Also having this issue with macOS + Homebrew. The 64-bit integer (4403520865) that causes the overflow error is suspiciously similar to the PDF file size, which is 4403741686 bytes.
Stacktrace
$ ocrmypdf --tesseract-timeout=0 --optimize 0 00334_C00004-E001.pdf 00334_C00004-E001.ocr.pdf
Scanning contents: 100%|████████████████████████| 502/502 [00:02<00:00, 203.02page/s]
Start processing 16 pages concurrently`
Image processing: 100%|████████████████████████| 502.0/502.0 [09:08<00:00, 1.09s/page]
Postprocessing...
PDF/A conversion: 100%|████████████████████████| 502/502 [02:58<00:00, 2.81page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
An exception occurred while executing the pipeline
Traceback (most recent call last):
File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 375, in run_pipeline
exec_concurrent(context, executor)
File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 299, in exec_concurrent
pdf = post_process(pdf, context, executor)
File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 233, in post_process
return optimize_pdf(pdf_out, context, executor)
File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 831, in optimize_pdf
optimize(input_file, output_file, context, save_settings, executor)
File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 574, in optimize
pike.save(target_file, **save_settings)
File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/pikepdf/_methods.py", line 772, in save
self._save(
ValueError: integer out of range converting 4403520865 from a 8-byte signed type to a 4-byte signed type
Software versions
$ ocrmypdf --version
12.7.2
$ img2pdf --version
img2pdf 0.4.3
$ qpdf --version
qpdf version 10.4.0
@jbarlow83 qpdf is at 10.4.0. I have more than 190GB of space free. For problem files for the time being, I am now converting JP2 images to JPG and then running img2pdf and then my usual ocrmypdf commands.
Also couldn't reproduce with: Apple Silicon, macOS 12.0.1, GamalNorsk-JP2.pdf, ocrmypdf 13.1.1, Python 3.9.9 Homebrew, qpdf 10.4.0, img2pdf 0.4.3, ocrmypdf 13.1.1, same command line as previously attempted.
I'll try my old Intel Mac....
Also not reproducible on Intel Mac, macOS 10.15 Catalina, same code and file.
Closing due to inactivity and older versions.
Trying OCRmyPDF for the first time today and ran into this issue.
If it is helpful for troubleshooting, this is what I am seeing on my WSL2 Ubuntu 20.04 LTS environment
brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf --version
9.6.0+dfsg
brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf mypdf.pdf out.pdf
ERROR - An exception occurred while executing the pipeline
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 379, in run_pipeline
pdfinfo = get_pdfinfo(
File "/usr/lib/python3/dist-packages/ocrmypdf/_pipeline.py", line 149, in get_pdfinfo
return PdfInfo(
File "/usr/lib/python3/dist-packages/ocrmypdf/pdfinfo/info.py", line 753, in __init__
self._pages, pdf = _pdf_get_all_pageinfo(
File "/usr/lib/python3/dist-packages/ocrmypdf/pdfinfo/info.py", line 613, in _pdf_get_all_pageinfo
pdf = pikepdf.open(infile) # Do not close in this function
File "/usr/lib/python3/dist-packages/pikepdf/__init__.py", line 71, in open
return Pdf.open(*args, **kwargs)
ValueError: integer out of range converting 4294967295 from a 8-byte signed type to a 4-byte signed type
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf mypdf.pdf out.pdf
integer out of range converting 4294967295 from a 8-byte signed type to a 4-byte signed type
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf --version
qpdf version 9.1.1
Run qpdf --copyright to see copyright and license information.
Since earlier in the thread, it seemed app versions were contributing to the problem I spun up another WSL2 environment running Ubuntu 22.x. The error goes away (this is the same PDF file).
brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf --version
13.4.0+dfsg
brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf mypdf.pdf out.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 57.56page/s]
This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.
Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.
Start processing 16 pages concurrently
OCR: 0%| | 0.0/19.0 [00:00<?, ?page/s]
PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf mypdf.pdf out.pdf
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf --version
qpdf version 10.6.3
Run qpdf --copyright to see copyright and license information.
brad@DESKTOP-QAJA04E:~/ocrmypdf$
My novice conclusion to this is that old software versions may be the root cause to this issue. Happy to provide additional detail if needed.
Same issue here. Traceback:
An exception occurred while executing the pipeline
Traceback (most recent call last):
File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_sync.py", line 393, in run_pipeline
optimize_messages = exec_concurrent(context, executor)
File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_sync.py", line 309, in exec_concurrent
pdf, messages = post_process(pdf, context, executor)
File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_sync.py", line 242, in post_process
return optimize_pdf(pdf_out, context, executor)
File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_pipeline.py", line 839, in optimize_pdf
output_pdf, messages = context.plugin_manager.hook.optimize_pdf(
File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_hooks.py", line 265, in __call__
return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_manager.py", line 80, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_callers.py", line 60, in _multicall
return outcome.get_result()
File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_result.py", line 60, in get_result
raise ex[1].with_traceback(ex[2])
File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_callers.py", line 39, in _multicall
res = hook_impl.function(*args)
File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\builtin_plugins\optimize.py", line 135, in optimize_pdf
result_path = optimize(input_pdf, output_pdf, context, save_settings, executor)
File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\optimize.py", line 646, in optimize
pike.save(target_file, **save_settings)
File "C:\Users\Quinn\anaconda3\lib\site-packages\pikepdf\_methods.py", line 653, in save
self._save(
ValueError: integer out of range converting 2477217292 from a 8-byte signed type to a 4-byte signed type
Also getting the same issue. Was anyone able to determine how to fix it?
For anyone still seeing this issue, it is because you have a very old version of pikepdf and qpdf. I am locking the thread because that is the answer.
For Anaconda users, please note that Anaconda's version of OCRmyPDF is 2 major releases behind and its version of pikepdf is 3 major releases behind. Anaconda packages often lags significantly behind pretty much everything. Please consider switching to the standard Python ecosystem which is often much more stable.
If you are able to reproduce on ocrmypdf 16+ or pikepdf 8+, open a new issue.
When processing certain files, an "integer out of range" error occurs. The error on two files I processed today:
The command I usually run is something like the one I ran today. I typically have a directory with a number of files that require OCR using the same language(s) and this runs through the directory OCRing everything. The high timeout is to accommodate dense pages of text and multiple languages, as large, multilingual dictionaries (which I do a lot of) are slow to process.
The files I was working with can be downloaded from Archive.org here: (note that Archive.org is having uptime problems as of today Sept. 19 17h00 UTC).
https://archive.org/download/gamalnorskordbok00haeguoft/gamalnorskordbok00haeguoft.pdf https://archive.org/download/nynorsketymologi00torp/nynorsketymologi00torp.pdf
Expected behavior
Usually these errors do not occur, and the post-processing happens fine. It is not clear to me why it happens on some files and not on others. It does not seem be related to file length. For example, the files it happened on today were 908 pages and 652 pages (which are not, by far, the upper boundaries of the files I work with; I process a lot of multilingual dictionaries with high page-counts and most of them do not run into this problem).
Here is the full text of the exception leading up to the error:
System