Open pprw opened 1 day ago
Thanks for the report. I have not yet further analyzed this, but somehow the page dimensions do not really match for the provided test files: The image has 923 x 1181 pixels, but the ocr_page
bounding box is 2458 x 3150 and thus much larger.
Thanks for the quick answer and pointing me in the good direction.
hocr file is created by the following kraken command
kraken -h -I r240_iecl_bki08-2.pdf -o .hocr -f pdf segment -bl ocr -m /home/pierre/bin/kraken/10592716/catmus-print-fondue-large.mlmodel
This command is extracting all pages for the pdf to png files (in /tmp) and uses this files for segmenting and recognition and then produces one hocr file per page.
The same test page uploaded (in jpeg) in the first post is the following png
$ pnginfo tmpxkpzga40.png
tmpxkpzga40.png...
Image Width: 2458 Image Length: 3150
Bitdepth (Bits/Sample): 8
Channels (Samples/Pixel): 4
Pixel depth (Pixel Depth): 32
Colour Type (Photometric Interpretation): RGB with alpha channel
Image filter: Single row per byte filter
Interlacing: No interlacing
Compression Scheme: Deflate method 8, 32k window
Resolution: 11811, 11811 (pixels per meter)
FillOrder: msb-to-lsb
Byte Order: Network (Big Endian)
Number of text strings: 0
So 2458x3150.
When I rebuild the pdf with hocr-pdf, I usually extract images from the pdf with pdfimages
$ pdfimages -j r240_iecl_bki08-2.pdf r240_iecl_bki08-2
which produces image of 1230x1575
$ exiftool r240_iecl_bki08-2-001.jpg
ExifTool Version Number : 12.57
File Name : r240_iecl_bki08-2-001.jpg
Directory : .
File Size : 142 kB
File Modification Date/Time : 2024:11:05 10:05:07+01:00
File Access Date/Time : 2024:11:05 10:05:20+01:00
File Inode Change Date/Time : 2024:11:05 10:05:07+01:00
File Permissions : -rw-r--r--
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
DCT Encode Version : 100
APP14 Flags 0 : (none)
APP14 Flags 1 : (none)
Color Transform : YCbCr
Image Width : 1230
Image Height : 1575
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
Image Size : 1230x1575
Megapixels : 1.9
(for the test page of the first post, I extracted this image with pdfarranger and it seems this is why the ratio is 923 x 1181 pixels (and not 1230x1575)
So I guess I need a way to extract images with the exact same dimension than kraken (but in jpeg)
I will looking for tools to do this (if you have a idea do not hesitate)
kraken seems to use libvips for PDF handling: https://github.com/mittagessen/kraken/blob/659d249297cf6b74bcb625c9eef1b52a28a54a61/kraken/kraken.py#L373-L409, with lines 389-403 being especially relevant. As far as I understand, this first loads the full document to retrieve the number of pages before rendering each page as a PNG file (with 300 dpi resolution) to a temporary file. In your case, you should be able to do something similar (untested):
import sys
from pathlib import Path
from pyvips import Image
path = Path(sys.argv[1]).resolve()
full_document = Image.new_from_file(str(path), dpi=300, n=-1, access="sequential")
n_pages = doc.get('n-pages')
assert n_pages
for i in range(0, n_pages):
page = Image.new_from_file(str(path), dpi=300, page=i, access="sequential")
target_path = path.with_stem(f'{path.stem}-{i:03d}').with_suffix('.jpg')
page.write_to_file(str(target_path))
As you can see, I have a problem with text alignment with some pdfs produced with hocr-pdf and from a file hocr created by kraken.
I am not sure it is related to hocr-pdf.
test files:
bug_alignement.zip