Tesseract cannot read files with backslashes in the name

MerlijnWajer commented 3 years ago

Environment

Tesseract Version: 4.1.1 (but same bug is present on master)
Platform: Linux gentoo-x230 5.6.18-grsec #2 SMP Tue Jul 7 18:17:17 CEST 2020 x86_64 Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz GenuineIntel GNU/Linux

Current Behavior:

Tesseract cannot read files with a backslash in their name.

$ wc -c /tmp/test\\.jp2
455359 /tmp/test\.jp2
merlijn@gentoo-x230 /tmp $ tesseract /tmp/test\\.jp2 -
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.

Expected Behavior:

Tesseract should be able to read files with a backslash in their name.

Suggested Fix:

I think the problem might be in leptonica - will follow up shortly.

MerlijnWajer commented 3 years ago

I didn't have a debug build of leptonica handy, but peeked at the source and searched for fopenReadStream, since that occurred in the failure message.

It looks like genPathname messes up the filename, and changes it from "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2" to "/tmp/example/427527-/nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2".

This then fails to open, which is the file not found message that we see.

Breakpoint 1, 0x00007ffff7aa53b0 in fopenReadStream () from /usr/lib64/liblept.so.5
(gdb) print (char*)$rdi
$9 = 0x7fffffffd094 "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
(gdb) c
Continuing.Breakpoint 4, 0x00007ffff7aa50b0 in genPathname () from /usr/lib64/liblept.so.5
(gdb) print (char*)$rdi
$10 = 0x7fffffffd094 "/tmp/example/427527-\\nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"
(gdb) step
Single stepping until exit from function genPathname,
which has no line number information.
(gdb) print (char*)$rax
$19 = 0x5555555b0570 "/tmp/example/427527-/nagripracharni Patrika Year 60 Vol 2 Ac 2610_0000.jp2"

So I guess this a problem in leptonica.

stweil commented 3 years ago

That's correct. Leptonica "normalizes" path names to use only / (on Linux) or \ (on Windows). We can only fix that with a new Leptonica version.

stweil commented 3 years ago

@MerlijnWajer, I am afraid that you'll have to work around that problem until a fixed Leptonica is available. Either link the image file to a name without \ and run Tesseract on that file, or let Tesseract read the image from standard input by using a pipe. The drawback is that you won't get the original image name in hOCR output. You might also consider using the Python API tesserocr instead of the command line Tesseract.

MerlijnWajer commented 3 years ago

Yeah, that's fair enough. I'll see what makes the most sense. I am not using the title= in the hOCR output right now, so I might just rename the files for Tesseract for now, if they contain a backslash, and deal with the fact that the hOCR will contain invalid image names in the title= attribute in the page node.

stweil commented 3 years ago

Of course you can also post-process the hOCR file and fix the name again.

DanBloomberg commented 3 years ago

This is now fixed in leptonica at head. See leptonica#558.

MerlijnWajer commented 3 years ago

I've built a version of leptonica based on Ubuntu 20.04 liblept5 locally with just the commit with the fix added, and it works for me. Not sure when it makes sense to close this (Tesseract) bug report.

tesseract-ocr / tesseract