Text not Detecting in conversation

sairash commented 5 months ago

Tesseract.js version 5.0.4

Describe the bug For Some reason when I use conversation text it is not detecting conversations under in blue container

To Reproduce Steps to reproduce the behavior: just use the image test

Expected behavior It needs to be Device Version:

Linux + Arch
Browser [Brave]

Kishlay-notabot commented 4 months ago

It works when you crop the specific blue part of the image and try to detect it. I am using a web app to detect the text [uses tesseract.js]

I suppose Tesseract sets somewhat like a threshold for a whole image when it tries to detect text in it, in terms of easiness.
The engine probably scans the whole image, and the most contrasting text above in the image is the white text and grey background, and maybe it takes that as a relative reference to scan the whole image? I suppose it tries to find the text which is the most easiest to detect? i.e. with a high contrast with the background. I have no prior experience or knowledge of the internal working of the engine, but I think the program might work like the way I just hypothesized.

I tried converting the image to grayscale before executing Tesseract OCR on them, but the results aren't what we expect, again.
Below is the image converted to grayscale and processed as whole, but still the words aren't recognized.
grayscale image

When I apply binarization on the grayscale image, the result is kind of matching to my hypothesis, the blue text is totally not visible. So yes, maybe tesseract is running something like a uniformity inducing or binarization algorithm equivalent pre processing code before running ocr I suppose. [I feel I am wrong]
binarized image

sairash commented 4 months ago

I also tried using different filters and stuff. Nothing seems to be working. But when I use tesseract-wasm it detects the texts

swappy-20240205_221940

But it is unreliable when using small res pic and also the same problem starts when I have any white borders in the image.

swappy-20240205_222121

Balearica commented 4 months ago

Tesseract.js includes an output option that allows you to retrieve the actual binarized image recognized by Tesseract. An example site using this option can be found here. Sure enough, as speculated by @Kishlay-notabot, that confirms that the messages in blue are being erased by the binarization process.

download (50)

Using this example code, you should be able to experiment with Tesseract's binarization options. These are not documented in this repo, however you can find them in the main Tesseract project's repo, and I pasted the descriptions from the code below. I have not used these options before, so am not sure what (if any) options would improve results with this screenshot. If none of these options work, you would need to either (1) binarize the image properly yourself before sending to Tesseract or (2) crop the images to specific messages before processing.

    , INT_MEMBER(thresholding_method,
                 static_cast<int>(ThresholdMethod::Otsu),
                 "Thresholding method: 0 = Otsu, 1 = LeptonicaOtsu, 2 = "
                 "Sauvola",
                 this->params())
    , BOOL_MEMBER(thresholding_debug, false,
                  "Debug the thresholding process",
                  this->params())
    , double_MEMBER(thresholding_window_size, 0.33,
                    "Window size for measuring local statistics (to be "
                    "multiplied by image DPI). "
                    "This parameter is used by the Sauvola thresholding method",
                    this->params())
    , double_MEMBER(thresholding_kfactor, 0.34,
                    "Factor for reducing threshold due to variance. "
                    "This parameter is used by the Sauvola thresholding method."
                    " Normal range: 0.2-0.5",
                    this->params())
    , double_MEMBER(thresholding_tile_size, 0.33,
                    "Desired tile size (to be multiplied by image DPI). "
                    "This parameter is used by the LeptonicaOtsu thresholding "
                    "method",
                    this->params())
    , double_MEMBER(thresholding_smooth_kernel_size, 0.0,
                    "Size of convolution kernel applied to threshold array "
                    "(to be multiplied by image DPI). Use 0 for no smoothing. "
                    "This parameter is used by the LeptonicaOtsu thresholding "
                    "method",
                    this->params())
    , double_MEMBER(thresholding_score_fraction, 0.1,
                    "Fraction of the max Otsu score. "
                    "This parameter is used by the LeptonicaOtsu thresholding "
                    "method. "
                    "For standard Otsu use 0.0, otherwise 0.1 is recommended",
                    this->params())

Balearica commented 4 months ago

It looks like the image is recognized perfectly, without needing to change any Tesseract.js settings, when it is first inverted. I don't know how generalizable this is since message apps can differ between white on black/black on white/mixed, however inverting the image to black text on a light background solves in this case.

text_1_invert

Kishlay-notabot commented 4 months ago

@Balearica Strange behaviour, try binarization on the inverted image, I think that here in this case, the fonts are black and the backgrounds are orange and grey respectively, which do provide a better contrast than blue and white combination in the original image. Contrast is the main thing..

naptha / tesseract.js

Text not Detecting in conversation #883