mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

image file is truncated (21 bytes not processed) #655

Closed gabays closed 2 weeks ago

gabays commented 3 weeks ago

Hello,

Some images were causing problems during the compilation and the script was crashing.

Extracting lines ━━━━━━━━━━━━━━━━              58% 374794/647835 0:31:07 0:20:44
RemoteTraceback: 
"""
Traceback (most recent call last):
  File 
"/opt/ebsofts/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/pool.p
y", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/kr
aken/lib/arrow_dataset.py", line 111, in _extract_line

  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/kr
aken/lib/arrow_dataset.py", line 85, in _extract_line
    if is_bitonal(im):
       ^^^^^^^^^^^^^^
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/kr
aken/lib/util.py", line 57, in is_bitonal
    return im.getcolors(2) is not None and len(im.getcolors(2)) == 2
           ^^^^^^^^^^^^^^^
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/PI
L/Image.py", line 1438, in getcolors
    self.load()
  File 
"/home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages/PI
L/ImageFile.py", line 297, in load
    raise OSError(msg)
OSError: image file is truncated (16 bytes not processed)
"""

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/users/g/gabays/build_modern_v2/kraken-env/bin/ketos:8 in <module>      │
│                                                                              │
│   5 from kraken.ketos import cli                                             │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8 │   sys.exit(cli())                                                      │
│   9                                                                          │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1157 in __call__                                              │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1078 in main                                                  │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1688 in invoke                                                │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:1434 in invoke                                                │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/core.py:783 in invoke                                                 │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /click/decorators.py:33 in new_func                                          │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /kraken/ketos/dataset.py:92 in compile                                       │
│                                                                              │
│    89 │   │   │   │   progress.start_task(extract_task)                      │
│    90 │   │   │   progress.update(extract_task, total=total, advance=advance │
│    91 │   │                                                                  │
│ ❱  92 │   │   arrow_dataset.build_binary_dataset(ground_truth,               │
│    93 │   │   │   │   │   │   │   │   │   │      output,                     │
│    94 │   │   │   │   │   │   │   │   │   │      format_type,                │
│    95 │   │   │   │   │   │   │   │   │   │      workers,                    │
│                                                                              │
│ /home/users/g/gabays/build_modern_v2/kraken-env/lib/python3.11/site-packages │
│ /kraken/lib/arrow_dataset.py:299 in build_binary_dataset                     │
│                                                                              │
│   296 │   │   │   │   if num_workers and num_workers > 1:                    │
│   297 │   │   │   │   │   logger.info(f'Spinning up processing pool with {nu │
│   298 │   │   │   │   │   with Pool(num_workers) as pool:                    │
│ ❱ 299 │   │   │   │   │   │   for page_lines, im_mode in pool.imap_unordered │
│   300 │   │   │   │   │   │   │   if page_lines:                             │
│   301 │   │   │   │   │   │   │   │   line_cache.extend(page_lines)          │
│   302 │   │   │   │   │   │   │   │   # comparison RGB(A) > L > 1            │
│                                                                              │
│ /opt/ebsofts/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/poo │
│ l.py:873 in next                                                             │
│                                                                              │
│   870 │   │   success, value = item                                          │
│   871 │   │   if success:                                                    │
│   872 │   │   │   return value                                               │
│ ❱ 873 │   │   raise value                                                    │
│   874 │                                                                      │
│   875 │   __next__ = next                    # XXX                           │
│   876                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: image file is truncated (16 bytes not processed)
srun: error: cpu132: task 0: Exited with exit code 1

I had to add a try/except to the _extract_Line() in arrow_dataset.py:

def _extract_line(xml_record, skip_empty_lines: bool = True, legacy_polygons: bool = False):
    lines = []
    try:
        im = Image.open(xml_record.imagename)
    except (FileNotFoundError, UnidentifiedImageError):
        return lines, None, None
    try:
        if is_bitonal(im):
            im = im.convert('1')
        for idx, rec in enumerate(xml_record.lines):
            seg = Segmentation(text_direction='horizontal-lr',
                               imagename=xml_record.imagename,
                               type=xml_record.type,
                               lines=[rec],
                               regions=None,
                               script_detection=False,
                               line_orders=[])
            try:
                line_im, line = next(extract_polygons(im, seg, legacy=legacy_polygons))
            except KrakenInputException:
                logger.warning(f'Invalid line {idx} in {xml_record.imagename}')
                continue
            except Exception as e:
                logger.warning(f'Unexpected exception {e} from line {idx} in {xml_record.imagename}')
                continue
            if not line.text and skip_empty_lines:
                continue
            fp = io.BytesIO()
            line_im.save(fp, format='png')
            lines.append({'text': line.text, 'im': fp.getvalue()})
    except Exception as e:
        with open('debug_error.txt', 'a') as debug_error_f:
            debug_error_f.write(str(e)+'\n')
            debug_error_f.write(str(xml_record.imagename)+'\n')
        logger.error(f'Unexpected exception {e} in {xml_record.imagename}')
        #raise e
    return lines, im.mode

I guess my hack is far from being perfect, but maybe it could lead to a better way to deal with the problem in the future.

Best,

Simon

mittagessen commented 2 weeks ago

That's a bug as the code is supposed to skip over unloadable files already but does so in a fairly conservative manner by only ignoring certain exceptions on loading. It didn't account for pillow only loading the actual image data on access and throwing an OSError if that fails.

I've shuffled the exception on top around and added OSError to the 'approved' list in there so it shouldn't crash for truncated files anymore.