madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.8k stars 719 forks source link

'JPEG2000' images are supported by PIL and Tesseract-OCR, but not pytesseract #409

Closed caerulescens closed 2 years ago

caerulescens commented 2 years ago

Problem

When a JPEG2000 image is loaded with pillow and run using pytesseract,

import io
import pytesseract
from PIL import Image

with open('example.jp2', 'rb') as f:
    image_bytes = f.read()
buffer = io.BytesIO(image_bytes)
image = Image.open(buffer)
result = pytesseract.image_to_pdf_or_hocr(image=image, extension="hocr")

An exception is raised: TypeError: Unsupported image format/type because JPEG2000 is not in SUPPORTED_FORMATS dictionary in pytesseract:

SUPPORTED_FORMATS = {
    'JPEG',
    'PNG',
    'PBM',
    'PGM',
    'PPM',
    'TIFF',
    'BMP',
    'GIF',
    'WEBP',
}

The issue with this is that tesseract and PIL both support JPEG2000 format images, so pytesseract should support the union of their behavior.

Creating a JPEG2000 image

I tried to attach a JPEG2000 image, but GitHub doesn't like that so I've attached a PNG instead with the code to create one.

import io
from PIL import Image

with open('example.png', 'rb') as f:
    image_data = f.read()
buffer = io.BytesIO(image_data)
image = Image.open(buffer)
image.save("example.jp2", "JPEG2000")

example

Solution

Adding JPEG2000 to SUPPORTED_FORMATS fixes the issues and returns the expected OCR results. This is because pillow uses JPEG2000 for image.format internally, and it passes the type check during pytesseract preparing.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 4.0.0' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf ocrp_lang ocrp_dir ocrp_font ocrp_fsize'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "/tmp/tess_gjz2_4dc.JPEG2000"; bbox 0 0 2000 153; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 26 30 1954 121">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 26 30 1954 121">
     <span class='ocr_line' id='line_1_1' title="bbox 26 30 1954 121; baseline 0.001 -21; x_size 90; x_descenders 20; x_ascenders 22">
      <span class='ocrx_word' id='word_1_1' title='bbox 26 31 190 101; x_wconf 96; x_fsize 90'>The</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 221 31 456 121; x_wconf 96; x_fsize 90'>quick</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 482 31 755 101; x_wconf 96; x_fsize 90'>brown</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 783 30 915 101; x_wconf 96; x_fsize 90'>fox</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 939 31 1164 121; x_wconf 96; x_fsize 90'>jumps</span>
      <span class='ocrx_word' id='word_1_6' title='bbox 1171 53 1418 101; x_wconf 96; x_fsize 90'>over</span>
      <span class='ocrx_word' id='word_1_7' title='bbox 1446 31 1575 101; x_wconf 96; x_fsize 90'>the</span>
      <span class='ocrx_word' id='word_1_8' title='bbox 1605 31 1773 121; x_wconf 94; x_fsize 90'>lazy</span>
      <span class='ocrx_word' id='word_1_9' title='bbox 1804 31 1954 121; x_wconf 96; x_fsize 90'>dog</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>
caerulescens commented 2 years ago

@madmaze I went ahead and added the fix and a test to a pull request. Seems to work as expected.

caerulescens commented 2 years ago

@int3l See above; I'd appreciate your feedback.