naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.09k stars 2.15k forks source link

`Tesseract.recognize` returns empty string in `data.text` #886

Closed KaKi87 closed 4 months ago

KaKi87 commented 4 months ago

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo) 5.0.4

Describe the bug

console.log((await Tesseract.recognize(buffer)).data.text); // --> ''

To Reproduce Steps to reproduce the behavior:

  1. Download the image below
  2. Pass it as a buffer to Tesseract.recognize
  3. Notice data.text contains '' and no error is thrown

Please attach any input image required to replicate this behavior.

Expected behavior data.text contains the content of the image.

Device Version:

Additional context None


Thanks

Balearica commented 4 months ago

Closing as not a bug. Tesseract.js returns an empty string when no text is detected, so the fact that it does not throw an error is an intended behavior.

The fact that no text is returned for this particular image is also not a bug, as this appears to be a CAPTCHA, and therefore was specifically designed to not be recognizable by Tesseract (and similar programs).

KaKi87 commented 4 months ago

Tesseract.js returns an empty string when no text is detected, so the fact that it does not throw an error is an intended behavior.

Well, then it would be nice to mention this in the API documentation.

That said, I don't feel it makes sense to return success on failure 🤔

Balearica commented 4 months ago

In general, runtime errors should only be thrown when a program fails to run to completion. If Tesseract recognition fails to run (return code 1) an error will be thrown. If Tesseract runs and exits successfully (return code 0), that will not throw an error, even if the results happen to be incorrect (which Tesseract has no way of knowing). Furthermore, there is no reason to assume finding no text on a page is incorrect. The single most common use of OCR is document scanning, and documents frequently contain pages with no text.

KaKi87 commented 4 months ago

The single most common use of OCR is document scanning, and documents frequently contain pages with no text.

I see.

Still, I wouldn't have created this issue if this was mentioned in the API documentation, especially considering that I've already successfully used this lib for solving captchas from different sources.

Thanks

Balearica commented 4 months ago

Okay, I've added a warning to api.md that states that exceptions are not thrown when no text is detected.

KaKi87 commented 4 months ago

Thanks !