ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
12.77k stars 934 forks source link

[Bug]: NotImplementedError: not sure how to get colorspace #1315

Open macdeport opened 1 month ago

macdeport commented 1 month ago

Describe the bug

Rare error on an Adobe InDesign 18.0 file (Macintosh)

Steps to reproduce

$ocrmypdf -v1 --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt bid.pdf bid_.pdf

Files

bid.pdf

How did you download and install the software?

MacPorts

OCRmyPDF version

ocrmypdf 16.2.0

Relevant log output

ocrmypdf 16.2.0
Running: ['tesseract', '--version']
Found tesseract 5.3.3
Running: ['tesseract', '--version']
Running: ['pngquant', '--version']
Found pngquant 3.0.3
Running: ['jbig2', '--version']
Found jbig2 0.28
Running: ['gs', '--version']
Found gs 10.3.0
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (4):
deu
eng
fra
osd

pikepdf mmap enabled
os.symlink(bid.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/origin)
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/origin, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/origin.pdf)
Gathering info with 1 thread workers
pikepdf mmap enabled

Using Tesseract OpenMP thread limit 1
Start processing 12 pages concurrently
pikepdf mmap enabled
pikepdf mmap enabled
pikepdf mmap enabled
    1 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    2 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    3 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    4 skipping all processing on this page
pikepdf mmap enabled
pikepdf mmap enabled
    5 skipping all processing on this page
pikepdf mmap enabled
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    6 skipping all processing on this page
    7 skipping all processing on this page
    8 skipping all processing on this page
    9 skipping all processing on this page
   10 skipping all processing on this page
   11 skipping all processing on this page
   12 skipping all processing on this page
   13 skipping all processing on this page
   14 skipping all processing on this page
   15 skipping all processing on this page
   16 skipping all processing on this page
   17 skipping all processing on this page
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
   18 skipping all processing on this page
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0
    3 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    3 Page rotation: (content, auto) -> page = (0, 0) -> 0
    4 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    4 Page rotation: (content, auto) -> page = (0, 0) -> 0
    5 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    5 Page rotation: (content, auto) -> page = (0, 0) -> 0
    6 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    6 Page rotation: (content, auto) -> page = (0, 0) -> 0
    7 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    7 Page rotation: (content, auto) -> page = (0, 0) -> 0
    8 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    8 Page rotation: (content, auto) -> page = (0, 0) -> 0
    9 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    9 Page rotation: (content, auto) -> page = (0, 0) -> 0
   10 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   10 Page rotation: (content, auto) -> page = (0, 0) -> 0
   11 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   11 Page rotation: (content, auto) -> page = (0, 0) -> 0
   12 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   12 Page rotation: (content, auto) -> page = (0, 0) -> 0
   13 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   13 Page rotation: (content, auto) -> page = (0, 0) -> 0
   14 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   14 Page rotation: (content, auto) -> page = (0, 0) -> 0
   15 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   15 Page rotation: (content, auto) -> page = (0, 0) -> 0
   16 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   16 Page rotation: (content, auto) -> page = (0, 0) -> 0
   17 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   17 Page rotation: (content, auto) -> page = (0, 0) -> 0
   18 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   18 Page rotation: (content, auto) -> page = (0, 0) -> 0

/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/sidecar.txt -> bid.txt
Postprocessing...
Running: ['tesseract', '--version']
xref 200: treating as an optimization candidate
xref 199: treating as an optimization candidate
xref 197: treating as an optimization candidate
xref 198: treating as an optimization candidate
xref 204: treating as an optimization candidate
xref 214: treating as an optimization candidate
xref 218: treating as an optimization candidate
xref 211: treating as an optimization candidate
xref 213: treating as an optimization candidate
xref 215: treating as an optimization candidate
xref 221: treating as an optimization candidate
xref 207: treating as an optimization candidate
xref 206: treating as an optimization candidate
xref 209: treating as an optimization candidate
xref 210: treating as an optimization candidate
xref 217: treating as an optimization candidate
xref 208: treating as an optimization candidate
xref 219: treating as an optimization candidate
xref 220: treating as an optimization candidate
xref 223: treating as an optimization candidate
xref 222: treating as an optimization candidate
xref 212: treating as an optimization candidate
xref 216: treating as an optimization candidate
xref 197: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
XrefExt(xref=197, ext='.jpg')
xref 199: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
XrefExt(xref=199, ext='.jpg')
xref 200: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
XrefExt(xref=200, ext='.jpg')
xref 204: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 204: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 201, in extract_image_generic
    ext = pim.extract_to(stream=f)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 709, in extract_to
    return self._extract_to_stream(stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 655, in _extract_to_stream
    im = self._extract_transcoded()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 610, in _extract_transcoded
    raise HifiPrintImageNotTranscodableError()
pikepdf.models.image.HifiPrintImageNotTranscodableError
xref 213: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 215, in extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK', <pikepdf.Stream(owner=<...>, data=b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'..., {
  "/BitsPerSample": 8,
  "/Decode": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/Encode": [ 0, 254 ],
  "/Filter": "/FlateDecode",
  "/FunctionType": 0,
  "/Length": 395,
  "/Order": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Size": [ 255 ]
})>]
xref 216: skipping image with small stream size
xref 217: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 215, in extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK', <pikepdf.Stream(owner=<...>, data=b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'..., {
  "/BitsPerSample": 8,
  "/Decode": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/Encode": [ 0, 254 ],
  "/Filter": "/FlateDecode",
  "/FunctionType": 0,
  "/Length": 395,
  "/Order": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Size": [ 255 ]
})>]
xref 219: skipping image with small stream size
xref 220: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 220: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 201, in extract_image_generic
    ext = pim.extract_to(stream=f)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 709, in extract_to
    return self._extract_to_stream(stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 655, in _extract_to_stream
    im = self._extract_transcoded()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 610, in _extract_transcoded
    raise HifiPrintImageNotTranscodableError()
pikepdf.models.image.HifiPrintImageNotTranscodableError
xref 221: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 221: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 201, in extract_image_generic
    ext = pim.extract_to(stream=f)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 709, in extract_to
    return self._extract_to_stream(stream=stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 655, in _extract_to_stream
    im = self._extract_transcoded()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 610, in _extract_transcoded
    raise HifiPrintImageNotTranscodableError()
pikepdf.models.image.HifiPrintImageNotTranscodableError
xref 222: skipping image with small stream size
xref 223: While extracting this image, an error occurred
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/optimize.py", line 215, in extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pikepdf/models/image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK', <pikepdf.Stream(owner=<...>, data=b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'..., {
  "/BitsPerSample": 8,
  "/Decode": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/Encode": [ 0, 254 ],
  "/Filter": "/FlateDecode",
  "/FunctionType": 0,
  "/Length": 395,
  "/Order": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ],
  "/Size": [ 255 ]
})>]
Optimizable images: JPEGs: 3 PNGs: 0

xref 200: treating as an optimization candidate
xref 199: treating as an optimization candidate
xref 197: treating as an optimization candidate
xref 198: treating as an optimization candidate
xref 204: treating as an optimization candidate
xref 214: treating as an optimization candidate
xref 218: treating as an optimization candidate
xref 211: treating as an optimization candidate
xref 213: treating as an optimization candidate
xref 215: treating as an optimization candidate
xref 221: treating as an optimization candidate
xref 207: treating as an optimization candidate
xref 206: treating as an optimization candidate
xref 209: treating as an optimization candidate
xref 210: treating as an optimization candidate
xref 217: treating as an optimization candidate
xref 208: treating as an optimization candidate
xref 219: treating as an optimization candidate
xref 220: treating as an optimization candidate
xref 223: treating as an optimization candidate
xref 222: treating as an optimization candidate
xref 212: treating as an optimization candidate
xref 216: treating as an optimization candidate
xref 197: marking this JPEG as deflatable
xref 199: marking this JPEG as deflatable
xref 200: marking this JPEG as deflatable
xref 204: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 204: marking this JPEG as deflatable
xref 216: skipping image with small stream size
xref 219: skipping image with small stream size
xref 220: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 220: marking this JPEG as deflatable
xref 221: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 221: marking this JPEG as deflatable
xref 222: skipping image with small stream size

xref 200: treating as an optimization candidate
xref 199: treating as an optimization candidate
xref 197: treating as an optimization candidate
xref 198: treating as an optimization candidate
xref 204: treating as an optimization candidate
xref 214: treating as an optimization candidate
xref 218: treating as an optimization candidate
xref 211: treating as an optimization candidate
xref 213: treating as an optimization candidate
xref 215: treating as an optimization candidate
xref 221: treating as an optimization candidate
xref 207: treating as an optimization candidate
xref 206: treating as an optimization candidate
xref 209: treating as an optimization candidate
xref 210: treating as an optimization candidate
xref 217: treating as an optimization candidate
xref 208: treating as an optimization candidate
xref 219: treating as an optimization candidate
xref 220: treating as an optimization candidate
xref 223: treating as an optimization candidate
xref 222: treating as an optimization candidate
xref 212: treating as an optimization candidate
xref 216: treating as an optimization candidate
xref 197: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 199: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 200: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 204: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 216: skipping image with small stream size
xref 219: skipping image with small stream size
xref 220: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 221: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
xref 222: skipping image with small stream size
Optimizable images: JBIG2 groups: 0

Image optimization did not improve the file - optimizations will not be used
Running: ['jbig2', '--version']
Running: ['pngquant', '--version']
Image optimization ratio: 1.00 savings: 0.0%
Total file size ratio: 1.05 savings: 4.9%
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.w6jubuga/optimize.pdf -> bid_.pdf
Corrupt JPEG data: 1 extraneous bytes before marker 0xd9
jbarlow83 commented 1 week ago

Most of these errors are harmless and mainly says that a particular image cannot be optimized because it's defined in terms of production printing (e.g. CMYK+) rather than RGB. Of course, it would be cleaner to log this fact, instead of logging an exception. I will have to make that change.

The error message at the end Corrupt JPEG data: 1 extraneous bytes before marker 0xd9 suggests that there is some corruption in the PDF - I'd check it with a viewer to ensure all images look fine visually.

user1823 commented 1 week ago

I also got a similar error (actually, the same error thousands of times in the same PDF):

xref 12157: While extracting this image, an error occurred                                               optimize.py:327
Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\ocrmypdf\optimize.py", line 323, in extract_images
    result = extract_fn(
             ^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\ocrmypdf\optimize.py", line 215, in
extract_image_generic
    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:
                             ^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pikepdf\models\image.py", line 210, in colorspace
    raise NotImplementedError(
NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceCMYK',
pikepdf.Dictionary({
  "/C0": [ 0, 0, 0, 0 ],
  "/C1": [ 0, 0, 0, 1 ],
  "/Domain": [ 0, 1 ],
  "/FunctionType": 2,
  "/N": 1,
  "/Range": [ 0, 1, 0, 1, 0, 1, 0, 1 ]
})]

Glad to hear that it is harmless. Hoping for a change to make this less scary.