houqp / leptess

Productive and safe Rust binding for leptonica and tesseract
https://houqp.github.io/leptess/leptess/index.html
MIT License
259 stars 29 forks source link

different result with tesseract cli and leptess wrapper #41

Open tcastelly opened 2 years ago

tcastelly commented 2 years ago

Hello,

Thank you for this work!

I have a curious behavior, when I try to retrieve the text from the image bellow in command line:

time tesseract image.jpg output  

I have as result,

Coco Adel

But when I use the wrapper

fn main() {
    let mut lt = leptess::LepTess::new(Some("./tests"), "eng").unwrap();
    // let mut lt = leptess::LepTess::new(None, "eng").unwrap();
    lt.set_image("image.jpg");
    println!("{}", lt.get_utf8_text().unwrap());
}

I have:

rh

I've tried to use the traineddata from this repository. Or nothing. But same result.

Maybe the command line use default parameters.

Thanks in advance

image

houqp commented 2 years ago

Hod did you install tesseract and libtesseract? What version of tessearct do you have?

tcastelly commented 2 years ago

Thank you for your answer.

I'm on Gnu Archlinux, I installed:

pacman -S tesseract leptonica tesseract-data-eng
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.5.2 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0
ccouzens commented 2 years ago

My tesseract was installed through Fedora's dnf install tesseract command

tesseract 4.1.3
 leptonica-1.81.1
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

My tesseract command gives the expected Coco Adel output. Through leptess, I also get rh\n.

Converting the image to a png changed leptess's output slightly "Nr\n".

I created a new image with the same resolution and similar sized text and leptess was able to parse it correctly. issue_41

I don't know why the command and API have different behaviour on your image. It may be worth checking to see if the command sets any additional options.

houqp commented 2 years ago

Yeah, most likely that the command line uses different set of default options :(

ongchi commented 2 years ago

The default page seg mode for leptess is set to 6, which is block mode, and the default value for tesseract would be 3, which is auto.

Setting this variable manually would get the same result:

lt.set_variable(Variable::TesseditPagesegMode, "3").unwrap();

So, maybe the default value for page seq mode for leptess should set to 3 to consistent with tesseract, and also preventing someone get unexpected results.

FYI The cli set default page seg mode to PSM_AUTO:

https://github.com/tesseract-ocr/tesseract/blob/be15b46c609e6d50f1665345d6e6fc128462593c/src/tesseract.cpp#L650

But PSM_SINGLE_BLOCK in library.

https://github.com/tesseract-ocr/tesseract/blob/be15b46c609e6d50f1665345d6e6fc128462593c/include/tesseract/publictypes.h#L166