Open stweil opened 2 years ago
I just tested tesseract
on Windows 10 with a path which contains Chinese characters.
Normally argv[0]
which is passed as an argument to function main
contains tesseract.exe
with the full path. This works as long as all characters from the path are included in the code page. Chinese characters are not in that code page, and obviously they are replaced by ?
characters.
So tesseract
could simply check for any ?
in argv[0]
and abort with an error message if one is found.
That would not add support for Unicode paths but at least avoid some problems.
Generally I am not sure whether it is worth to support that feature. It is easy to avoid filenames and paths with problematic Unicode characters.
Regarding Leptonica: maybe Tesseract can use pixReadStream
, which accepts FILE*
.
@stweil
Normally
argv[0]
which is passed as an argument to functionmain
containstesseract.exe
with the full path. This works as long as all characters from the path are included in the code page. Chinese characters are not in that code page, and obviously they are replaced by?
characters.So
tesseract
could simply check for any?
inargv[0]
and abort with an error message if one is found.That would not add support for Unicode paths but at least avoid some problems.
Forget it. This is not reliable. ?
is normally used in decode/encode/recode as the default in Latin-based 8-bit encodings and can mean: invalid encoding, broken encoding, character not assigned, glyph not in font. In other combinations of en-/de-coding, string manipulation of software involved, font (.undef glyph) and renderer (they sometimes ignore the font) you can see other symbols. If you want to waste your time or must analyse the problem only a hexdump of the string helps, to see what it really is. Just bail out or die with a message, if the file could not be found or opened. Users with this problem should ascify or slugify their filenames to [a-z0-9._+-] - yes, no spaces, no uppercase, no "special" characters, no punctuation except .
, no escaping.
I agree. It would not be sufficient to upgrade the Tesseract code for full Unicode support on Windows, because all libraries (Leptonica, graphic libraries, libarchive, ...) have the same problem.
Windows is simply a nightmare regarding standard support. And as you said, Tesseract will report if it cannot find or open a file, and it is easy for users to avoid the problem.
Use UTF-8 code pages in Windows apps
Regarding older Windows versions: Windows 7 is EOL since January 2020. Windows 8.1 has a small market share (compared to other Windows versions) and will reach EOL in January 2023.
Should we add a manifest file tor Unicode support on Windows 10/11?
Yes, I think so. But then we also need build rules which add the manifest to tesseract.exe
(and all other executables?). And that build rule should not depend on Microsoft's mt.exe
, but use some alternative which also works in cross builds running on Linux.
Tesseract currently has problems when the path of the executable contains Unicode characters which are not supported by the current code page.
I also expect problems for any filenames given to Tesseract (for example image names) which include such characters.
See pull request #3708 which triggered this issue.