Closed agutkarsh closed 2 years ago
Parse it from the hOCR output with hocr_font_info
enabled.
Given an image,
Run the below,
import pytesseract
from PIL import Image
from io import BytesIO
with open("./example.png", "rb") as f:
data = f.read()
image = Image.open(BytesIO(data))
data = pytesseract.image_to_pdf_or_hocr(image, config="-c hocr_font_info=1", lang='eng', extension="hocr")
print(data.decode('utf-8'))
And you will get output containing the word font attributes:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract 4.0.0' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf ocrp_lang ocrp_dir ocrp_font ocrp_fsize'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "/tmp/tess_l3frd4kr.PNG"; bbox 0 0 1443 136; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 33 47 1413 94">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 33 47 1413 94">
<span class='ocr_line' id='line_1_1' title="bbox 33 47 1413 94; baseline 0 -10; x_size 47; x_descenders 10; x_ascenders 12">
<span class='ocrx_word' id='word_1_1' title='bbox 33 50 109 84; x_wconf 95; x_fsize 48'>Has</span>
<span class='ocrx_word' id='word_1_2' title='bbox 125 59 244 94; x_wconf 96; x_fsize 48'>anyone</span>
<span class='ocrx_word' id='word_1_3' title='bbox 247 47 347 84; x_wconf 96; x_fsize 48'>been</span>
<span class='ocrx_word' id='word_1_4' title='bbox 350 59 408 84; x_wconf 96; x_fsize 48'>as</span>
<span class='ocrx_word' id='word_1_5' title='bbox 411 47 495 84; x_wconf 95; x_fsize 48'>far</span>
<span class='ocrx_word' id='word_1_6' title='bbox 511 59 549 84; x_wconf 95; x_fsize 48'>as</span>
<span class='ocrx_word' id='word_1_7' title='bbox 565 59 656 84; x_wconf 95; x_fsize 48'>even</span>
<span class='ocrx_word' id='word_1_8' title='bbox 672 47 824 84; x_wconf 96; x_fsize 48'>decided</span>
<span class='ocrx_word' id='word_1_9' title='bbox 838 54 874 84; x_wconf 96; x_fsize 48'>to</span>
<span class='ocrx_word' id='word_1_10' title='bbox 889 54 986 84; x_wconf 96; x_fsize 48'>want</span>
<span class='ocrx_word' id='word_1_11' title='bbox 999 54 1036 84; x_wconf 96; x_fsize 48'>to</span>
<span class='ocrx_word' id='word_1_12' title='bbox 1052 47 1098 84; x_wconf 96; x_fsize 48'>do</span>
<span class='ocrx_word' id='word_1_13' title='bbox 1113 47 1201 84; x_wconf 96; x_fsize 48'>look</span>
<span class='ocrx_word' id='word_1_14' title='bbox 1215 59 1294 84; x_wconf 96; x_fsize 48'>more</span>
<span class='ocrx_word' id='word_1_15' title='bbox 1297 47 1413 84; x_wconf 95; x_fsize 48'>like.</span>
</span>
</p>
</div>
</div>
</body>
</html>
The value you're looking for is x_fsize
; the value's accuracy depends on providing the correct DPI.
IIRC, I've seen the font name attribute (x_font
) in the output for tesseract 5, but it wasn't working just now for whatever reason; maybe because there are too few words in the image.
@int3l Can you please close this; it's not an issue with the library.
@caerulescens Thanks, I got that
I want to get the font size and font style of the text present in the image. Is there any way to do so using tesseract because I read it somewhere that WordFontAttributes worked anly for 3.0.5 version not with 4.0.0 or latest. Please help me out if there is any other way present currently to do so