ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
359 stars 78 forks source link

hocr-extract-images doesn't see image #185

Closed bicolino34 closed 5 months ago

bicolino34 commented 5 months ago

I had a one .jpg image and created hocr file for it with the program gImageReader. They have identical names and are located in the same directory. I tried to run in terminal to see how the script works: hocr-extract-images ./20240128_105503.html It produces the error: not found: './20240128_105503.jpg' even though the image file is in the very same directory with this name that is shown as not found. Specifying image directory with -b doesn't help. I also tried converting image to .png

FriedrichFroebel commented 5 months ago

How does your tag defining the page with its image look like? Example of how it should look like from the tests:

<div class='ocr_page' id='page_1' title='image "alice_1.png"; bbox 0 0 2488 3507; ppageno 0'>
bicolino34 commented 5 months ago

@FriedrichFroebel It looks like this <div title="bbox 0 0 3468 4624; image './20240128_105503.jpg'; ppageno 1; res 100; rot 90; scan_res 100 100" class="ocr_page" id="page_1">

FriedrichFroebel commented 5 months ago

This is related to https://github.com/ocropus/hocr-tools/blob/2867727ae986dd1e1727d98300da053caaffdb9b/hocr-extract-images#L28 where single quotation marks are expected in the outer level and only double quotation marks as nested values. Using

            args = args.strip('"\'')

there instead seems to fix it.

bicolino34 commented 5 months ago

Thank you! This has solved the issue