ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
371 stars 79 forks source link

hocr-extract-images doesn't see image #185

Closed bicolino34 closed 10 months ago

bicolino34 commented 10 months ago

I had a one .jpg image and created hocr file for it with the program gImageReader. They have identical names and are located in the same directory. I tried to run in terminal to see how the script works: hocr-extract-images ./20240128_105503.html It produces the error: not found: './20240128_105503.jpg' even though the image file is in the very same directory with this name that is shown as not found. Specifying image directory with -b doesn't help. I also tried converting image to .png

FriedrichFroebel commented 10 months ago

How does your tag defining the page with its image look like? Example of how it should look like from the tests:

<div class='ocr_page' id='page_1' title='image "alice_1.png"; bbox 0 0 2488 3507; ppageno 0'>
bicolino34 commented 10 months ago

@FriedrichFroebel It looks like this <div title="bbox 0 0 3468 4624; image './20240128_105503.jpg'; ppageno 1; res 100; rot 90; scan_res 100 100" class="ocr_page" id="page_1">

FriedrichFroebel commented 10 months ago

This is related to https://github.com/ocropus/hocr-tools/blob/2867727ae986dd1e1727d98300da053caaffdb9b/hocr-extract-images#L28 where single quotation marks are expected in the outer level and only double quotation marks as nested values. Using

            args = args.strip('"\'')

there instead seems to fix it.

bicolino34 commented 10 months ago

Thank you! This has solved the issue