naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.91k stars 2.21k forks source link

【BUG】cannot recognize the four directions texts for part symbol image #861

Closed easyeda2021 closed 9 months ago

easyeda2021 commented 9 months ago

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo) v5.0.3 Describe the bug as the image img_v3_026c_030a1028-ef46-4555-9838-c291aaf3670g for example, page 1 https://atta.szlcsc.com/upload/public/pdf/source/20151029/1457707509740.pdf

miss texts: image

To Reproduce Steps to reproduce the behavior: take the screeshot, and then import to Tesseract

Please attach any input image required to replicate this behavior. image image image

Expected behavior support recognize four directions texts and correctly

Device Version:

Additional context no

thank you for the nice job

Balearica commented 9 months ago

There are multiple intersecting reasons why these particular images perform poorly, however all are issues with the Tesseract OCR engine rather than Tesseract.js, so fixing would be outside of the scope of this repo.

  1. Tesseract is not capable of handling multiple text orientations within the same image
    • Tesseract should be capable of recognizing "this entire image needs to be rotated 90 degrees", however it is not capable of recognizing "this word needs to be rotated 90 degrees"
    • Edit: This is partially incorrect, see below.
  2. Tesseract often performs poorly when non-text elements are combined with text elements
    • Underlining text, drawing boxes around text, etc. often throws Tesseract off
  3. Tesseract's often performs poorly when recognizing complex layouts
    • Any layout more complex than a basic 1 or 2 column layout, including images where text is essentially scattered throughout, is likely to perform poorly

For context, Tesseract.js is the Javascript/Webassembly port of Tesseract. We do not make any edits to the recognition engine, so any accuracy issues with the Tesseract engine are outside of the scope of this project. Therefore, if you would like to pursue further, you should consult the documentation and discussion for the main Tesseract project. You may find that there are configuration settings that may help to achieve better results.

If you do not find settings that improve recognition, and believe this constitutes a (previously unreported) bug, then you should replicate the issue using the main (CLI) Tesseract project program and raise the issue with that project.

Edit: My first bullet point was partially incorrect. When run in PSM mode AUTO (3) Tesseract can create multiple blocks per page, and text orientation is detected on the block level. Therefore, it is theoretically possible for horizontal and vertical text to be detected on the same page in this mode. However, in my experience enabling PSM AUTO does not work particularly well, and often results in words being categorized as noise and deleted. Therefore, I doubt that changing to PSM AUTO will solve this particular issue.

easyeda2021 commented 9 months ago

Hi Balearica thank you for your reply, I got it, we will check this issue if Tesseract project met it thanks