tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.43k stars 9.53k forks source link

Strange output result for chinese recognition #2505

Open ZWhitey opened 5 years ago

ZWhitey commented 5 years ago

Environment

Language Data:

chi_tra (best)

Input:

c and c1 are same image except c1 remove the last character

Current Behavior:

tesseract -l chi_tra c.png stdout --psm 7 tesseract -l chi_tra c1.png stdout --psm 7

c.png got empty result c1.png got correct 綠綠

Expected Behavior:

c get 綠綠綠 c1 get 綠綠

Shreeshrii commented 5 years ago

Try with --psm 6.

ZWhitey commented 5 years ago

Try with --psm 6.

I have tried psm with 3, 6, 7 and 13 3, 6 and 7 got empty result 13 got incorrect result 斷 斷 絲 0

Shreeshrii commented 5 years ago

The problem seems to be with layout analysis.

  tesseract -l chi_tra c.png stdout --psm 7  --dpi 300 --oem 1 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-322-g74ac' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "c.png"; bbox 0 0 538 75; ppageno 0'>
  </div>
 </body>
</html>
 tesseract -l chi_tra c.png stdout --psm 13  --dpi 300 --oem 1 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-322-g74ac' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "c.png"; bbox 0 0 538 75; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 197 17 325 51">
    <p class='ocr_par' id='par_1_1' lang='chi_tra' title="bbox 197 17 325 51">
     <span class='ocr_line' id='line_1_1' title="bbox 197 17 325 51; baseline -0 -2; x_size 42.666668; x_descenders 10.666667; x_ascenders 10.666666">
      <span class='ocrx_word' id='word_1_1' title='bbox 0 0 38 75; x_wconf 26'>﹍</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 103 0 169 75; x_wconf 57'>﹣</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 206 17 229 51; x_wconf 61'>綬</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 262 17 277 51; x_wconf 81'>綠</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 314 17 325 51; x_wconf 61'>綾</span>
      <span class='ocrx_word' id='word_1_6' title='bbox 389 0 535 75; x_wconf 52'>ˍ</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>
asdbaihu commented 5 years ago

如何解决这个问题