Configure encoding/decoding to get non-ASCII characters (e.g., Chinese/Japanese/Korean etc.)

SunshineSpring666 commented 8 months ago

Hi Henri, Thanks a lot for the great library.

I put the chi_sim.traineddata into project and configured it in CreateMauiApp accordingly, but the test output contains scrambled characters.

It looks like an encoding/decoding issue. Are there any ways to make such configuration?

Thanks a lot. Best Regards, Joe

henrivain commented 8 months ago

Hello,

Yeah it sounds like encoding issue. I also had one on windows. The library gets its recognized text from native tesseract library method TESS_API char *TessBaseAPIGetUTF8Text(TessBaseAPI *handle); I think dllimport might convert string to different encoding. In windows I converted string to byte array and encoded it to UTF-8 with code down below in TessPage.

result is string that is returned from recignizion

Encoding class is from System.Text namespace

var bytes = new byte[result.Length];
for (int i = 0; i < result.Length; i++)
{
    bytes[i] = (byte)result[i];
}
if (bytes is null)
{
    return string.Empty;
}
try
{
    return Encoding.UTF8.GetString(bytes);
}
catch (Exception ex)
{
    throw new InvalidBytesException("Cannot encode current byte array, because it contains invalid bytes.", ex);
}

It shouldn't matter in which point you change string encoding. It is not changed by the library after dllimport.

Let me know if this helps.

Regards, Henri

SunshineSpring666 commented 8 months ago

Hi Henri, I added the encoding process which takes the Tesseract output string as a bytearray, and encode it by UTF8. However I didn't get the expected result, it's still scrambled characters at my side.

So I attached the demo project and test data here (MaApp.zip). Please help check if something is wrong.

Thanks for your reply. Best Regards, Joe

OutputScreenshot

henrivain commented 8 months ago

Is the image from app or dev terminal?

Can you copy the string to here as text?

SunshineSpring666 commented 8 months ago

The image as input is here: InputImage

And the output string is as follows: % x K H } N bO

Strange enough, the string which clipboard grabbed skipped all the question-mark-in-diamond, as shown in the image: encoded_output_str

Below is the string before the additional UTF8 encoding process, it's almost the same as the Encoded one without line break: % x ¹ K H } N bO

tesseract_output

It seems like the reencoding doesn't map the output into right characters. Could you please reproduce the output at your side？ Thanks.

henrivain commented 8 months ago

I test tomorrow if I have time. Looks very likely to be encoding problem, it might still need to be in different encoding than utf 8

henrivain commented 8 months ago

Is the text chinese are which language? I might need this information later.

SunshineSpring666 commented 8 months ago

Yes, both the page and trained data are in Simplified Chinese.

Below are some Simplified Chinese characters for testing purposes. You may just snip an image:

《静夜思》床前明月光，疑是地上霜。举头望明月，低头思故乡。

Best Regards, Joe

henrivain commented 8 months ago

Hi,

I think I have figured out how to fix the problem.

Dll import with string return type seems to mess up the encoding process. It probably replaces characters that can't be encoded with defined CharSet correctly with some other character and so changing the byte values.

TesseractApi.cs

// This is old call
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessBaseAPIGetUTF8Text", CharSet = CharSet.Ansi)]
public static extern string GetUTF8Text(HandleRef handle);

Setting return type as char pointer and handling string marshalling later seems to fix the problem (image at the start of this comment). I think I'll be adding new method in native api calls and handling string conversion after the call as below

TesseractApi.cs

// Probably new call will be this
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl, EntryPoint = "TessBaseAPIGetUTF8Text")]
public unsafe static extern char* GetUTF8Text_Unsafe(HandleRef handle);

TessPage.cs

unsafe
{
     char* ptr = TesseractApi.GetUTF8Text_Unsafe(Engine.Handle);
     string? result = Marshal.PtrToStringUTF8(new IntPtr(ptr));
}

I have still to test things to see that everything works. Changes must be made at my library level so there is no quick easy fix that you can do yourself now, but I try to get the change out quickly.

Henri

SunshineSpring666 commented 8 months ago

Hi Henry,

That's great! Thanks for all your effort.

Best Regards, Joe

SunshineSpring666 commented 8 months ago

👍Thanks, Henry! Hope to see the updated NuGet package soon.

henrivain / TesseractOcrMaui

Configure encoding/decoding to get non-ASCII characters (e.g., Chinese/Japanese/Korean etc.) #38