Closed MerlijnWajer closed 3 years ago
I didn't have a debug build of leptonica handy, but peeked at the source and searched for fopenReadStream
, since that occurred in the failure message.
It looks like genPathname
messes up the filename, and changes it from "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
to "/tmp/example/427527-/nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
.
This then fails to open, which is the file not found
message that we see.
Breakpoint 1, 0x00007ffff7aa53b0 in fopenReadStream () from /usr/lib64/liblept.so.5
(gdb) print (char*)$rdi
$9 = 0x7fffffffd094 "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
(gdb) c
Continuing.Breakpoint 4, 0x00007ffff7aa50b0 in genPathname () from /usr/lib64/liblept.so.5
(gdb) print (char*)$rdi
$10 = 0x7fffffffd094 "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
(gdb) step
Single stepping until exit from function genPathname,
which has no line number information.
(gdb) print (char*)$rax
$19 = 0x5555555b0570 "/tmp/example/427527-/nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
So I guess this a problem in leptonica.
That's correct. Leptonica "normalizes" path names to use only /
(on Linux) or \
(on Windows). We can only fix that with a new Leptonica version.
@MerlijnWajer, I am afraid that you'll have to work around that problem until a fixed Leptonica is available. Either link the image file to a name without \
and run Tesseract on that file, or let Tesseract read the image from standard input by using a pipe. The drawback is that you won't get the original image name in hOCR output. You might also consider using the Python API tesserocr
instead of the command line Tesseract.
Yeah, that's fair enough. I'll see what makes the most sense. I am not using the title=
in the hOCR output right now, so I might just rename the files for Tesseract for now, if they contain a backslash, and deal with the fact that the hOCR will contain invalid image names in the title=
attribute in the page node.
Of course you can also post-process the hOCR file and fix the name again.
This is now fixed in leptonica at head. See leptonica#558.
I've built a version of leptonica based on Ubuntu 20.04 liblept5
locally with just the commit with the fix added, and it works for me. Not sure when it makes sense to close this (Tesseract) bug report.
Environment
Current Behavior:
Tesseract cannot read files with a backslash in their name.
Expected Behavior:
Tesseract should be able to read files with a backslash in their name.
Suggested Fix:
I think the problem might be in leptonica - will follow up shortly.