ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

ValueError: integer out of range converting 10585497845 from a 8-byte signed type to a 4-byte signed type #836

Closed grantbarrett closed 2 years ago

grantbarrett commented 3 years ago

When processing certain files, an "integer out of range" error occurs. The error on two files I processed today:

ValueError: integer out of range converting 10585497845 from a 8-byte signed type to a 4-byte signed type
ValueError: integer out of range converting 10301405285 from a 8-byte signed type to a 4-byte signed type

The command I usually run is something like the one I ran today. I typically have a directory with a number of files that require OCR using the same language(s) and this runs through the directory OCRing everything. The high timeout is to accommodate dense pages of text and multiple languages, as large, multilingual dictionaries (which I do a lot of) are slow to process.

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --output-type pdf --tesseract-timeout 600 --force-ocr -l nor --jbig2-lossy --optimize 3 '{}' '{}-OCR.pdf'

The files I was working with can be downloaded from Archive.org here: (note that Archive.org is having uptime problems as of today Sept. 19 17h00 UTC).

https://archive.org/download/gamalnorskordbok00haeguoft/gamalnorskordbok00haeguoft.pdf https://archive.org/download/nynorsketymologi00torp/nynorsketymologi00torp.pdf

Expected behavior

Usually these errors do not occur, and the post-processing happens fine. It is not clear to me why it happens on some files and not on others. It does not seem be related to file length. For example, the files it happened on today were 908 pages and 652 pages (which are not, by far, the upper boundaries of the files I work with; I process a lot of multilingual dictionaries with high page-counts and most of them do not run into this problem).

Here is the full text of the exception leading up to the error:

./SomeFile.pdf  Postprocessing...
./SomeFile.pdf  An exception occurred while executing the pipeline
./SomeFile.pdf  Traceback (most recent call last):
./SomeFile.pdf    File "/usr/local/Cellar/ocrmypdf/12.5.0/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
./SomeFile.pdf      exec_concurrent(context, executor)
./SomeFile.pdf    File "/usr/local/Cellar/ocrmypdf/12.5.0/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 298, in exec_concurrent
./SomeFile.pdf      pdf = post_process(pdf, context, executor)
./SomeFile.pdf    File "/usr/local/Cellar/ocrmypdf/12.5.0/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 233, in post_process
./SomeFile.pdf      return optimize_pdf(pdf_out, context, executor)
./SomeFile.pdf    File "/usr/local/Cellar/ocrmypdf/12.5.0/libexec/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 831, in optimize_pdf
./SomeFile.pdf      optimize(input_file, output_file, context, save_settings, executor)
./SomeFile.pdf    File "/usr/local/Cellar/ocrmypdf/12.5.0/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 571, in optimize
./SomeFile.pdf      pike.save(target_file, **save_settings)
./SomeFile.pdf    File "/usr/local/Cellar/ocrmypdf/12.5.0/libexec/lib/python3.9/site-packages/pikepdf/_methods.py", line 759, in save
./SomeFile.pdf      self._save(
./SomeFile.pdf  ValueError: integer out of range converting 10585497845 from a 8-byte signed type to a 4-byte signed type

System

grantbarrett commented 3 years ago

Some followup. To try to determine where in the series of commands it was going wrong, I modified it in several ways. One, I had it process just one file at a time, as one idea is that it might be a memory error. Two, I disabled OCR and tried only to run the file optimization, since that seems to be where the error is happening. Three, I remade the PDFs using the JP2 images available at Archive.org, considering that perhaps the PDF itself was somehow at fault. I also enabled verbose logging to learn more about what was happening (and also because looking at a terminal that appears to be doing nothing for several hours is frustrating).

The command now looks like this:

find . -name '*.pdf' | parallel --tag -j 1 ocrmypdf -v1 --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}'

For the first file that has completed, this has avoided the error; however, image optimization did not improve the file, which is a surprise.

PS: the verbose logging showed what is probably a typo of "image" in lines like:

Skipping JPEG2000 iamge, xref 1927

Looks like the typo is in optimize.py at line 100.

jbarlow83 commented 3 years ago

The optimizer does not try to optimize JPEG2000 images, because JPEG2000 images are widely used in medical imaging and to my knowledge anyway, haven't been adopted in many other fields. Also, if you are using whole page color images, there won't be anything to compress with JBIG2 since there will not be monochrome images. (We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)

Your new method avoids the error entirely, because the optimizer has nothing to do, assuming it has something to do with the optimizer.

If you're able to, please run the original file with the -k argument added, and then send a zip of the temporary folder directory. That will allow to examine the intermediate files that were present when the error occurred.

grantbarrett commented 3 years ago

I should note that I made an error in my original post. The PDFs I linked to were not the issue. They were fine. In fact, after further testing, I confirmed the error occurs exclusively with some, but not all, PDFs made from JP2 files via img2pdf.

For example, I ran my commands on PDF documents of 100, 200, 300, 400, and 500 pages made from JP2 files and they were fine.

However, making a PDF out of the JP2 files of the book at https://archive.org/details/gamalnorskordbok00haeguoft/, which is 648 pages, consistently fails.

Here is that PDF. https://www.dropbox.com/s/o7q5a8u4nujzsf6/GamalNorsk-JP2.pdf?dl=0

I ran the commands with the -k switch. However, the resulting temp directory is 86GB. Is there a particular part of it you would most like to see? I'm enclosing a directory listing and the debug log as they may be most helpful.

I think you are right: there's something happening at the end when optimization is supposed to occur. The "metafix.pdf" file (10.3 GB, or else I would attach it) in the temp directory, is a fully formed PDF file that has OCRed text where it should be and seems to be correct in all respects. Created right after that is an empty "images" folder and a zero bytes "optimize.opt.pdf" file, after which the process failed and nothing further was done.

If I then run an optimization command on metafix.pdf, skipping OCR, ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf '{}' '{}-SM.pdf' I get the same sort of error that I opened this thread with.

ocrmypdf.io.bp1q4vwp-dir-listing.txt debug.log

elliotwaite commented 2 years ago

I ran into this error as well. Here is my stack trace in case that helps:

  File "/Users/elliotwaite/code/pdf_tools/extract_text.py", line 36, in main
    ocrmypdf.ocr(
  File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/api.py", line 337, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 375, in run_pipeline
    exec_concurrent(context, executor)
  File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 299, in exec_concurrent
    pdf = post_process(pdf, context, executor)
  File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 232, in post_process
    pdf_out = metadata_fixup(pdf_out, context)
  File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 812, in metadata_fixup
    pdf.save(
  File "/Users/elliotwaite/mambaforge/lib/python3.9/site-packages/pikepdf/_methods.py", line 804, in save
    self._save(
ValueError: integer out of range converting 3744557654 from a 8-byte signed type to a 4-byte signed type

Looks like it might be occurring in pikepdf.

jbarlow83 commented 2 years ago

@elliotwaite Can you try the following:

  1. Using qpdf to rewrite your file:
    qpdf input.pdf output.pdf
  2. Using pikepdf to open and save your file:
    with pikepdf.open('input.pdf') as p:
    p.save('output.pdf')

Do you get a similar "integer out of range" error from either of these?

jbarlow83 commented 2 years ago

@grantbarrett I could not reproduce on Ubuntu using

# issue836.pdf is GamalNorsk-JP2.pdf
ocrmypdf --tesseract-timeout=0 --skip-text --jbig2-lossy --optimize 3 --output-type pdf issue836.pdf issue836_.pdf

Perhaps you ran out of disk space? What version of libqpdf is installed?

elliotwaite commented 2 years ago

I haven't tested out using qpdf or pikepdf directly yet. The PDF is big (140.3 MB, 19,038 pages), so it takes a long time to run while eating up my CPU, and I need to use my computer for other work at the moment, but maybe I can test those out in the future.

Here's my version info:

# Name                    Version                   Build    Channel
pikepdf                   4.0.0                    pypi_0    pypi
qpdf                      10.3.2               hefd3b78_0    conda-forge

And here's my OS info:

Screen Shot 2021-11-20 at 4 20 48 PM

I have 128 GB free disk space, so I don't think it was a disk space issue.

vramirez122000 commented 2 years ago

Also having this issue with macOS + Homebrew. The 64-bit integer (4403520865) that causes the overflow error is suspiciously similar to the PDF file size, which is 4403741686 bytes.

Stacktrace

$ ocrmypdf --tesseract-timeout=0 --optimize 0 00334_C00004-E001.pdf 00334_C00004-E001.ocr.pdf
Scanning contents: 100%|████████████████████████| 502/502 [00:02<00:00, 203.02page/s]
Start processing 16 pages concurrently`
Image processing: 100%|████████████████████████| 502.0/502.0 [09:08<00:00,  1.09s/page]
Postprocessing...
PDF/A conversion: 100%|████████████████████████| 502/502 [02:58<00:00,  2.81page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 375, in run_pipeline
    exec_concurrent(context, executor)
  File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 299, in exec_concurrent
    pdf = post_process(pdf, context, executor)
  File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 233, in post_process
    return optimize_pdf(pdf_out, context, executor)
  File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 831, in optimize_pdf
    optimize(input_file, output_file, context, save_settings, executor)
  File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 574, in optimize
    pike.save(target_file, **save_settings)
  File "/usr/local/Cellar/ocrmypdf/12.7.2/libexec/lib/python3.9/site-packages/pikepdf/_methods.py", line 772, in save
    self._save(
ValueError: integer out of range converting 4403520865 from a 8-byte signed type to a 4-byte signed type

Software versions

$ ocrmypdf --version
12.7.2
$ img2pdf --version
img2pdf 0.4.3
$ qpdf --version
qpdf version 10.4.0
grantbarrett commented 2 years ago

@jbarlow83 qpdf is at 10.4.0. I have more than 190GB of space free. For problem files for the time being, I am now converting JP2 images to JPG and then running img2pdf and then my usual ocrmypdf commands.

jbarlow83 commented 2 years ago

Also couldn't reproduce with: Apple Silicon, macOS 12.0.1, GamalNorsk-JP2.pdf, ocrmypdf 13.1.1, Python 3.9.9 Homebrew, qpdf 10.4.0, img2pdf 0.4.3, ocrmypdf 13.1.1, same command line as previously attempted.

I'll try my old Intel Mac....

jbarlow83 commented 2 years ago

Also not reproducible on Intel Mac, macOS 10.15 Catalina, same code and file.

jbarlow83 commented 2 years ago

Closing due to inactivity and older versions.

bradthurber commented 2 years ago

Trying OCRmyPDF for the first time today and ran into this issue.

If it is helpful for troubleshooting, this is what I am seeing on my WSL2 Ubuntu 20.04 LTS environment

brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf --version
9.6.0+dfsg
brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf mypdf.pdf out.pdf
  ERROR - An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 379, in run_pipeline
    pdfinfo = get_pdfinfo(
  File "/usr/lib/python3/dist-packages/ocrmypdf/_pipeline.py", line 149, in get_pdfinfo
    return PdfInfo(
  File "/usr/lib/python3/dist-packages/ocrmypdf/pdfinfo/info.py", line 753, in __init__
    self._pages, pdf = _pdf_get_all_pageinfo(
  File "/usr/lib/python3/dist-packages/ocrmypdf/pdfinfo/info.py", line 613, in _pdf_get_all_pageinfo
    pdf = pikepdf.open(infile)  # Do not close in this function
  File "/usr/lib/python3/dist-packages/pikepdf/__init__.py", line 71, in open
    return Pdf.open(*args, **kwargs)
ValueError: integer out of range converting 4294967295 from a 8-byte signed type to a 4-byte signed type
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf mypdf.pdf out.pdf
integer out of range converting 4294967295 from a 8-byte signed type to a 4-byte signed type
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf --version
qpdf version 9.1.1
Run qpdf --copyright to see copyright and license information.

Since earlier in the thread, it seemed app versions were contributing to the problem I spun up another WSL2 environment running Ubuntu 22.x. The error goes away (this is the same PDF file).

brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf --version
13.4.0+dfsg
brad@DESKTOP-QAJA04E:~/ocrmypdf$ ocrmypdf mypdf.pdf out.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 57.56page/s]
This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.
Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.
Start processing 16 pages concurrently
OCR:   0%|                                                                                 | 0.0/19.0 [00:00<?, ?page/s]
PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf mypdf.pdf out.pdf
brad@DESKTOP-QAJA04E:~/ocrmypdf$ qpdf --version
qpdf version 10.6.3
Run qpdf --copyright to see copyright and license information.
brad@DESKTOP-QAJA04E:~/ocrmypdf$

My novice conclusion to this is that old software versions may be the root cause to this issue. Happy to provide additional detail if needed.

quinn-p-mchugh commented 1 year ago

Same issue here. Traceback:

An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_sync.py", line 393, in run_pipeline
    optimize_messages = exec_concurrent(context, executor)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_sync.py", line 309, in exec_concurrent
    pdf, messages = post_process(pdf, context, executor)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_sync.py", line 242, in post_process
    return optimize_pdf(pdf_out, context, executor)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\_pipeline.py", line 839, in optimize_pdf
    output_pdf, messages = context.plugin_manager.hook.optimize_pdf(
  File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_callers.py", line 60, in _multicall
    return outcome.get_result()
  File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "C:\Users\Quinn\anaconda3\lib\site-packages\pluggy\_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\builtin_plugins\optimize.py", line 135, in optimize_pdf
    result_path = optimize(input_pdf, output_pdf, context, save_settings, executor)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\ocrmypdf\optimize.py", line 646, in optimize
    pike.save(target_file, **save_settings)
  File "C:\Users\Quinn\anaconda3\lib\site-packages\pikepdf\_methods.py", line 653, in save
    self._save(
ValueError: integer out of range converting 2477217292 from a 8-byte signed type to a 4-byte signed type
themantalope commented 7 months ago

Also getting the same issue. Was anyone able to determine how to fix it?

jbarlow83 commented 7 months ago

For anyone still seeing this issue, it is because you have a very old version of pikepdf and qpdf. I am locking the thread because that is the answer.

For Anaconda users, please note that Anaconda's version of OCRmyPDF is 2 major releases behind and its version of pikepdf is 3 major releases behind. Anaconda packages often lags significantly behind pretty much everything. Please consider switching to the standard Python ecosystem which is often much more stable.

If you are able to reproduce on ocrmypdf 16+ or pikepdf 8+, open a new issue.