tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.96k stars 9.48k forks source link

Add support for Unicode filenames on MS Windows #3709

Open stweil opened 2 years ago

stweil commented 2 years ago

Tesseract currently has problems when the path of the executable contains Unicode characters which are not supported by the current code page.

I also expect problems for any filenames given to Tesseract (for example image names) which include such characters.

See pull request #3708 which triggered this issue.

stweil commented 2 years ago

I just tested tesseract on Windows 10 with a path which contains Chinese characters.

Normally argv[0] which is passed as an argument to function main contains tesseract.exe with the full path. This works as long as all characters from the path are included in the code page. Chinese characters are not in that code page, and obviously they are replaced by ? characters.

So tesseract could simply check for any ? in argv[0] and abort with an error message if one is found.

That would not add support for Unicode paths but at least avoid some problems.

stweil commented 2 years ago

Generally I am not sure whether it is worth to support that feature. It is easy to avoid filenames and paths with problematic Unicode characters.

amitdo commented 2 years ago

https://github.com/DanBloomberg/leptonica/issues/537#issuecomment-691714238

danpla commented 2 years ago

Regarding Leptonica: maybe Tesseract can use pixReadStream, which accepts FILE*.

wollmers commented 2 years ago

@stweil

Normally argv[0] which is passed as an argument to function main contains tesseract.exe with the full path. This works as long as all characters from the path are included in the code page. Chinese characters are not in that code page, and obviously they are replaced by ? characters.

So tesseract could simply check for any ? in argv[0] and abort with an error message if one is found.

That would not add support for Unicode paths but at least avoid some problems.

Forget it. This is not reliable. ? is normally used in decode/encode/recode as the default in Latin-based 8-bit encodings and can mean: invalid encoding, broken encoding, character not assigned, glyph not in font. In other combinations of en-/de-coding, string manipulation of software involved, font (.undef glyph) and renderer (they sometimes ignore the font) you can see other symbols. If you want to waste your time or must analyse the problem only a hexdump of the string helps, to see what it really is. Just bail out or die with a message, if the file could not be found or opened. Users with this problem should ascify or slugify their filenames to [a-z0-9._+-] - yes, no spaces, no uppercase, no "special" characters, no punctuation except ., no escaping.

stweil commented 2 years ago

I agree. It would not be sufficient to upgrade the Tesseract code for full Unicode support on Windows, because all libraries (Leptonica, graphic libraries, libarchive, ...) have the same problem.

Windows is simply a nightmare regarding standard support. And as you said, Tesseract will report if it cannot find or open a file, and it is easy for users to avoid the problem.

amitdo commented 2 years ago

Use UTF-8 code pages in Windows apps

Regarding older Windows versions: Windows 7 is EOL since January 2020. Windows 8.1 has a small market share (compared to other Windows versions) and will reach EOL in January 2023.

amitdo commented 2 years ago

https://github.com/tesseract-ocr/tesseract/pull/3708#issuecomment-1162338217

amitdo commented 10 months ago

Should we add a manifest file tor Unicode support on Windows 10/11?

stweil commented 10 months ago

Yes, I think so. But then we also need build rules which add the manifest to tesseract.exe (and all other executables?). And that build rule should not depend on Microsoft's mt.exe, but use some alternative which also works in cross builds running on Linux.