henrivain / TesseractOcrMaui

Tesseract wrapper for Windows, Android and iOS for .NET MAUI
Apache License 2.0
37 stars 4 forks source link

Text whitelist #15

Closed adiamante closed 1 year ago

adiamante commented 1 year ago

Hi,

Thank you for making this. Would you be able to add text whitelist capability?

henrivain commented 1 year ago

Hello,

It seems to be possible.

Package should include all functionality to make whitelist to be used in your recognizion. It is not found in injectable ITesseract interface, but you can use it from TesseractOcrMaui.TessEngine class

You can find constructor with signature public TessEngine(string languages, string traineddataPath, EngineMode mode, IDictionary<string, object> initialOptions, ILogger? logger = null)

You can pass optional configuration parameters as IDict "initialOptions"

According to stackoverflow answer it seems that you can set whitelisted characters

Add value to your options with key "tessedit_char_whitelist" and string of whitelisted characters as value.

See stackoverflow

After running the constructor

You can use tesseract in a normal way. Example in here, about row 200

Note that

I am not able to validate this to work immidiately. I try to test this as soon as I can.

Regards, Henri Vainio

adiamante commented 1 year ago

Hi @henrivain,

No rush. For my use case, I'm looking for someting like

tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ");

since I need to update the whitelist during runtime.

henrivain commented 1 year ago

Hi @adiamante

You can find this method TesseractOcrMaui.TessEngine.SetVariable(string name, string value)

Line 166

I think this is what your are looking for.

Something like this

using var engine = new TessEngine("eng");
bool success = engine.SetVariable("tessedit_char_whitelist", "mychars");
using var image = Pix.LoadFromFile(@"c:\to\file.png");
using var result = engine.ProcessImage(image);
string text = result.GetText();
adiamante commented 1 year ago

Hey @henrivain,

With that sample code, is there a way for me to initialize the engine with a MAUI raw asset traineedata? I'm getting the following upon initializing TessEngine:

{TesseractOcrMaui.Exceptions.TesseractInitException: Cannot initialize Tesseract Api
 ---> System.InvalidOperationException: No traineddata files found from path. Do you have correct path and file names?
   --- End of inner exception stack trace ---
   at TesseractOcrMaui.TessEngine.Initialize(String languages, String traineddataPath, EngineMode mode, IDictionary`2 initialOptions)
   at TesseractOcrMaui.TessEngine..ctor(String languages, String traineddataPath, EngineMode mode, IDictionary`2 initialOptions, ILogger logger)
   at TesseractOcrMaui.TessEngine..ctor(String languages, String traineddataPath, ILogger logger)
   at YeetMacro2.Platforms.Android.Services.AndroidWindowManagerService..ctor(ILogger`1 logger, MediaProjectionService mediaProjectionService, IToastService toastService) in C:\Users\Desktop\Desktop\kappagacha\yeetmacro2\YeetMacro2\Platforms\Android\Services\AndroidWindowManagerService.cs:line 81}

I've also attempted to give it the path Raw and Resources/Raw to no avail. For Context, I am attempting this on Android.

adiamante commented 1 year ago

Got it to work with the following after looking at some code from the repository.

public static Stream GetAssetStream(string path)
{
    return FileSystem.OpenAppPackageFileAsync(path).Result;
}

var tranineddataPath = Path.Combine(FileSystem.Current.CacheDirectory, "eng.traineddata");
if (!File.Exists(tranineddataPath)) {
    var traineddata = ServiceHelper.GetAssetStream("eng.traineddata");
    FileStream fileStream = File.Create(tranineddataPath);
    traineddata.CopyTo(fileStream);
}

_tessEngine = new TessEngine("eng", FileSystem.Current.CacheDirectory);
henrivain commented 1 year ago

Great that you got it to work!

I also was now able to test it myself and I also got it to work.

Exception you got

The reason you had the exceptions was that traineddata was not loaded. Automatically loaded traineddata -functionality is only available in async methods in most high level ITesseract api. If TessEngine api is used directly, tessdata must also be downloaded manually. But you figured it out, so it is all okay.

Coming features

I try to add easier way to configure engine in runtime from ITesseract -interface, see issue #16

This issue seems to be figured out now

I close this with this comment. If you have any questions or other problems with the package, don't hesitate to contact me. I hope this package can help you in your projects!

-HV

adiamante commented 1 year ago

Hey @henrivain

The following works on an android emulator but fails on a physical device:

var imageData = {{my byte array}};
var page = Pix.LoadFromMemory(imageData);

I'm getting the followin exception:

System.IO.IOException: 'Failed to load image from memory.'

Any ideas on how I can troubleshoot this?

henrivain commented 1 year ago

I can reproduce. I think the problems are coming from native libraries, so I have to explore them better. I move this new problem to its own issue, because is it no longer related to whitelisting. If you have anything about this new issue, add them to the new related issue #17 Can you specify the image type/extension you are using?

adiamante commented 1 year ago

Thanks @henrivain