Open sairash opened 5 months ago
It works when you crop the specific blue part of the image and try to detect it.
I am using a web app to detect the text [uses tesseract.js]
I suppose Tesseract sets somewhat like a threshold for a whole image when it tries to detect text in it, in terms of easiness.
The engine probably scans the whole image, and the most contrasting text above in the image is the white text and grey background, and maybe it takes that as a relative reference to scan the whole image? I suppose it tries to find the text which is the most easiest to detect? i.e. with a high contrast with the background. I have no prior experience or knowledge of the internal working of the engine, but I think the program might work like the way I just hypothesized.
I tried converting the image to grayscale before executing Tesseract OCR on them, but the results aren't what we expect, again.
Below is the image converted to grayscale and processed as whole, but still the words aren't recognized.
grayscale image
When I apply binarization on the grayscale image, the result is kind of matching to my hypothesis, the blue text is totally not visible. So yes, maybe tesseract is running something like a uniformity inducing or binarization algorithm equivalent pre processing code before running ocr I suppose. [I feel I am wrong]
binarized image
I also tried using different filters and stuff. Nothing seems to be working. But when I use tesseract-wasm it detects the texts
But it is unreliable when using small res pic and also the same problem starts when I have any white borders in the image.
Tesseract.js includes an output option that allows you to retrieve the actual binarized image recognized by Tesseract. An example site using this option can be found here. Sure enough, as speculated by @Kishlay-notabot, that confirms that the messages in blue are being erased by the binarization process.
Using this example code, you should be able to experiment with Tesseract's binarization options. These are not documented in this repo, however you can find them in the main Tesseract project's repo, and I pasted the descriptions from the code below. I have not used these options before, so am not sure what (if any) options would improve results with this screenshot. If none of these options work, you would need to either (1) binarize the image properly yourself before sending to Tesseract or (2) crop the images to specific messages before processing.
, INT_MEMBER(thresholding_method,
static_cast<int>(ThresholdMethod::Otsu),
"Thresholding method: 0 = Otsu, 1 = LeptonicaOtsu, 2 = "
"Sauvola",
this->params())
, BOOL_MEMBER(thresholding_debug, false,
"Debug the thresholding process",
this->params())
, double_MEMBER(thresholding_window_size, 0.33,
"Window size for measuring local statistics (to be "
"multiplied by image DPI). "
"This parameter is used by the Sauvola thresholding method",
this->params())
, double_MEMBER(thresholding_kfactor, 0.34,
"Factor for reducing threshold due to variance. "
"This parameter is used by the Sauvola thresholding method."
" Normal range: 0.2-0.5",
this->params())
, double_MEMBER(thresholding_tile_size, 0.33,
"Desired tile size (to be multiplied by image DPI). "
"This parameter is used by the LeptonicaOtsu thresholding "
"method",
this->params())
, double_MEMBER(thresholding_smooth_kernel_size, 0.0,
"Size of convolution kernel applied to threshold array "
"(to be multiplied by image DPI). Use 0 for no smoothing. "
"This parameter is used by the LeptonicaOtsu thresholding "
"method",
this->params())
, double_MEMBER(thresholding_score_fraction, 0.1,
"Fraction of the max Otsu score. "
"This parameter is used by the LeptonicaOtsu thresholding "
"method. "
"For standard Otsu use 0.0, otherwise 0.1 is recommended",
this->params())
It looks like the image is recognized perfectly, without needing to change any Tesseract.js settings, when it is first inverted. I don't know how generalizable this is since message apps can differ between white on black/black on white/mixed, however inverting the image to black text on a light background solves in this case.
@Balearica Strange behaviour, try binarization on the inverted image, I think that here in this case, the fonts are black and the backgrounds are orange and grey respectively, which do provide a better contrast than blue and white combination in the original image. Contrast is the main thing..
Tesseract.js version 5.0.4
Describe the bug For Some reason when I use conversation text it is not detecting conversations under in blue container
To Reproduce Steps to reproduce the behavior: just use the image![test](https://github.com/naptha/tesseract.js/assets/29134272/559aa1c6-e318-4e0a-bcda-a4cee79079bb)
Expected behavior It needs to be Device Version: