Closed SunshineSpring666 closed 8 months ago
Hello,
Yeah it sounds like encoding issue. I also had one on windows. The library gets its recognized text from native tesseract library method
TESS_API char *TessBaseAPIGetUTF8Text(TessBaseAPI *handle);
I think dllimport might convert string to different encoding. In windows I converted string to byte array and encoded it to UTF-8 with code down below in TessPage.
result
is string that is returned from recignizion
Encoding
class is from System.Text
namespace
var bytes = new byte[result.Length];
for (int i = 0; i < result.Length; i++)
{
bytes[i] = (byte)result[i];
}
if (bytes is null)
{
return string.Empty;
}
try
{
return Encoding.UTF8.GetString(bytes);
}
catch (Exception ex)
{
throw new InvalidBytesException("Cannot encode current byte array, because it contains invalid bytes.", ex);
}
It shouldn't matter in which point you change string encoding. It is not changed by the library after dllimport.
Let me know if this helps.
Regards, Henri
Hi Henri, I added the encoding process which takes the Tesseract output string as a bytearray, and encode it by UTF8. However I didn't get the expected result, it's still scrambled characters at my side.
So I attached the demo project and test data here (MaApp.zip). Please help check if something is wrong.
Thanks for your reply. Best Regards, Joe
Is the image from app or dev terminal?
Can you copy the string to here as text?
The image as input is here:
And the output string is as follows: % x K H } N bO
Strange enough, the string which clipboard grabbed skipped all the question-mark-in-diamond, as shown in the image:
Below is the string before the additional UTF8 encoding process, it's almost the same as the Encoded one without line break: % x ¹ K H } N bO
It seems like the reencoding doesn't map the output into right characters. Could you please reproduce the output at your side? Thanks.
I test tomorrow if I have time. Looks very likely to be encoding problem, it might still need to be in different encoding than utf 8
Is the text chinese are which language? I might need this information later.
Yes, both the page and trained data are in Simplified Chinese.
Below are some Simplified Chinese characters for testing purposes. You may just snip an image:
《静夜思》 床前明月光,疑是地上霜。 举头望明月,低头思故乡。
Best Regards, Joe
Hi,
I think I have figured out how to fix the problem.
Dll import with string return type seems to mess up the encoding process. It probably replaces characters that can't be encoded with defined CharSet correctly with some other character and so changing the byte values.
TesseractApi.cs
// This is old call
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessBaseAPIGetUTF8Text", CharSet = CharSet.Ansi)]
public static extern string GetUTF8Text(HandleRef handle);
Setting return type as char pointer and handling string marshalling later seems to fix the problem (image at the start of this comment). I think I'll be adding new method in native api calls and handling string conversion after the call as below
TesseractApi.cs
// Probably new call will be this
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessBaseAPIGetUTF8Text")]
public unsafe static extern char* GetUTF8Text_Unsafe(HandleRef handle);
TessPage.cs
unsafe
{
char* ptr = TesseractApi.GetUTF8Text_Unsafe(Engine.Handle);
string? result = Marshal.PtrToStringUTF8(new IntPtr(ptr));
}
I have still to test things to see that everything works. Changes must be made at my library level so there is no quick easy fix that you can do yourself now, but I try to get the change out quickly.
Henri
Hi Henry,
That's great! Thanks for all your effort.
Best Regards, Joe
👍Thanks, Henry! Hope to see the updated NuGet package soon.
Hi Henri, Thanks a lot for the great library.
I put the chi_sim.traineddata into project and configured it in CreateMauiApp accordingly, but the test output contains scrambled characters.
It looks like an encoding/decoding issue. Are there any ways to make such configuration?
Thanks a lot. Best Regards, Joe