houqp / leptess

Productive and safe Rust binding for leptonica and tesseract
https://houqp.github.io/leptess/leptess/index.html
MIT License
258 stars 29 forks source link

[REQ] Allow injection of config parameters/files for tesseract #33

Open mlieberman85 opened 3 years ago

mlieberman85 commented 3 years ago

Unless I'm missing something there doesn't seem to be a way to put either config parameters e.g. --psm 3 --oem0 or refer to a config file in the high level Leptess API.

ccouzens commented 3 years ago

You're right. It's not there today.

It looks like PSM (PageSegMode) would correspond to https://tesseract-ocr.github.io/tessapi/5.x/a00008.html#a7393e8cb70161c588eff1dbb5e97e4d5

And OEM (OCR engine mode) would need to use a different init function https://tesseract-ocr.github.io/tessapi/5.x/a00008.html#a75e22aabb144f06f07741188df3cc41a

I don't know if there is a c function to use a config file.


I'll work on these, but I'm planning to do some re-factoring first. https://github.com/houqp/leptess/issues/34

mlieberman85 commented 3 years ago

Thanks! I am not very experienced in Rust, especially with unsafe, but if you pointed me in the right direction I could take a crack at it after you've started your refactor.

From my understanding reading through the C++ "configs" parameter in the init functions actually refers to config files, i.e. config files under TESSDATA_PREFIX. I traced "configs" to: https://github.com/tesseract-ocr/tesseract/blob/2dfa38a0728b30485a7137d140724f014dc6b5d6/src/ccmain/tessedit.cpp#L365-L380

The API also does have: https://tesseract-ocr.github.io/tessapi/5.x/a00008.html#a19e00633eb5ea36356fa02b3f3b694a3

I couldn't find the ability to just pass in other config values like you would via the command line like: tessedit_char_whitelist.

mlieberman85 commented 3 years ago

For the time being I worked around it using https://github.com/antimatter15/tesseract-rs:

let mut tba = TessBaseAPI::new();
    tba.init_4(None, Some(&CString::new("eng")?), tesseract_sys::TessOcrEngineMode_OEM_TESSERACT_ONLY)?;