stefan6419846 / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
https://hocr-tools-lib.readthedocs.io/
Other
2 stars 0 forks source link

wrong text position/alignment with hocr-pdf #34

Open pprw opened 1 day ago

pprw commented 1 day ago

As you can see, I have a problem with text alignment with some pdfs produced with hocr-pdf and from a file hocr created by kraken.

image

I am not sure it is related to hocr-pdf.

test files:

bug_alignement.zip

stefan6419846 commented 1 day ago

Thanks for the report. I have not yet further analyzed this, but somehow the page dimensions do not really match for the provided test files: The image has 923 x 1181 pixels, but the ocr_page bounding box is 2458 x 3150 and thus much larger.

pprw commented 22 hours ago

Thanks for the quick answer and pointing me in the good direction.

hocr file is created by the following kraken command

kraken -h -I r240_iecl_bki08-2.pdf -o .hocr -f pdf segment -bl ocr -m /home/pierre/bin/kraken/10592716/catmus-print-fondue-large.mlmodel

This command is extracting all pages for the pdf to png files (in /tmp) and uses this files for segmenting and recognition and then produces one hocr file per page.

The same test page uploaded (in jpeg) in the first post is the following png

tmpxkpzga40

$ pnginfo tmpxkpzga40.png 
tmpxkpzga40.png...
  Image Width: 2458 Image Length: 3150
  Bitdepth (Bits/Sample): 8
  Channels (Samples/Pixel): 4
  Pixel depth (Pixel Depth): 32
  Colour Type (Photometric Interpretation): RGB with alpha channel 
  Image filter: Single row per byte filter 
  Interlacing: No interlacing 
  Compression Scheme: Deflate method 8, 32k window
  Resolution: 11811, 11811 (pixels per meter)
  FillOrder: msb-to-lsb
  Byte Order: Network (Big Endian)
  Number of text strings: 0

So 2458x3150.

When I rebuild the pdf with hocr-pdf, I usually extract images from the pdf with pdfimages

$ pdfimages -j r240_iecl_bki08-2.pdf r240_iecl_bki08-2

which produces image of 1230x1575

$ exiftool r240_iecl_bki08-2-001.jpg 
ExifTool Version Number         : 12.57
File Name                       : r240_iecl_bki08-2-001.jpg
Directory                       : .
File Size                       : 142 kB
File Modification Date/Time     : 2024:11:05 10:05:07+01:00
File Access Date/Time           : 2024:11:05 10:05:20+01:00
File Inode Change Date/Time     : 2024:11:05 10:05:07+01:00
File Permissions                : -rw-r--r--
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
DCT Encode Version              : 100
APP14 Flags 0                   : (none)
APP14 Flags 1                   : (none)
Color Transform                 : YCbCr
Image Width                     : 1230
Image Height                    : 1575
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
Image Size                      : 1230x1575
Megapixels                      : 1.9

(for the test page of the first post, I extracted this image with pdfarranger and it seems this is why the ratio is 923 x 1181 pixels (and not 1230x1575)

So I guess I need a way to extract images with the exact same dimension than kraken (but in jpeg)

I will looking for tools to do this (if you have a idea do not hesitate)

stefan6419846 commented 20 hours ago

kraken seems to use libvips for PDF handling: https://github.com/mittagessen/kraken/blob/659d249297cf6b74bcb625c9eef1b52a28a54a61/kraken/kraken.py#L373-L409, with lines 389-403 being especially relevant. As far as I understand, this first loads the full document to retrieve the number of pages before rendering each page as a PNG file (with 300 dpi resolution) to a temporary file. In your case, you should be able to do something similar (untested):

import sys
from pathlib import Path

from pyvips import Image

path = Path(sys.argv[1]).resolve()
full_document = Image.new_from_file(str(path), dpi=300, n=-1, access="sequential")
n_pages = doc.get('n-pages')
assert n_pages

for i in range(0, n_pages):
    page = Image.new_from_file(str(path), dpi=300, page=i, access="sequential")
    target_path = path.with_stem(f'{path.stem}-{i:03d}').with_suffix('.jpg')
    page.write_to_file(str(target_path))