Open ZWhitey opened 5 years ago
Try with --psm 6.
Try with --psm 6.
I have tried psm with 3, 6, 7 and 13
3, 6 and 7 got empty result
13 got incorrect result 斷 斷 絲 0
The problem seems to be with layout analysis.
tesseract -l chi_tra c.png stdout --psm 7 --dpi 300 --oem 1 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 5.0.0-alpha-322-g74ac' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "c.png"; bbox 0 0 538 75; ppageno 0'>
</div>
</body>
</html>
tesseract -l chi_tra c.png stdout --psm 13 --dpi 300 --oem 1 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 5.0.0-alpha-322-g74ac' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "c.png"; bbox 0 0 538 75; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 197 17 325 51">
<p class='ocr_par' id='par_1_1' lang='chi_tra' title="bbox 197 17 325 51">
<span class='ocr_line' id='line_1_1' title="bbox 197 17 325 51; baseline -0 -2; x_size 42.666668; x_descenders 10.666667; x_ascenders 10.666666">
<span class='ocrx_word' id='word_1_1' title='bbox 0 0 38 75; x_wconf 26'>﹍</span>
<span class='ocrx_word' id='word_1_2' title='bbox 103 0 169 75; x_wconf 57'>﹣</span>
<span class='ocrx_word' id='word_1_3' title='bbox 206 17 229 51; x_wconf 61'>綬</span>
<span class='ocrx_word' id='word_1_4' title='bbox 262 17 277 51; x_wconf 81'>綠</span>
<span class='ocrx_word' id='word_1_5' title='bbox 314 17 325 51; x_wconf 61'>綾</span>
<span class='ocrx_word' id='word_1_6' title='bbox 389 0 535 75; x_wconf 52'>ˍ</span>
</span>
</p>
</div>
</div>
</body>
</html>
如何解决这个问题
Environment
Language Data:
chi_tra (best)
Input:
c and c1 are same image except c1 remove the last character
Current Behavior:
tesseract -l chi_tra c.png stdout --psm 7
tesseract -l chi_tra c1.png stdout --psm 7
c.png got empty result c1.png got correct 綠綠
Expected Behavior:
c get 綠綠綠 c1 get 綠綠