madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.8k stars 719 forks source link

Get font size and font style using tesseract #430

Closed agutkarsh closed 2 years ago

agutkarsh commented 2 years ago

I want to get the font size and font style of the text present in the image. Is there any way to do so using tesseract because I read it somewhere that WordFontAttributes worked anly for 3.0.5 version not with 4.0.0 or latest. Please help me out if there is any other way present currently to do so

caerulescens commented 2 years ago

Parse it from the hOCR output with hocr_font_info enabled.

caerulescens commented 2 years ago

Given an image,

example

Run the below,

import pytesseract
from PIL import Image
from io import BytesIO

with open("./example.png", "rb") as f:
    data = f.read()
image = Image.open(BytesIO(data))
data = pytesseract.image_to_pdf_or_hocr(image, config="-c hocr_font_info=1", lang='eng', extension="hocr")
print(data.decode('utf-8'))

And you will get output containing the word font attributes:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 4.0.0' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf ocrp_lang ocrp_dir ocrp_font ocrp_fsize'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "/tmp/tess_l3frd4kr.PNG"; bbox 0 0 1443 136; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 33 47 1413 94">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 33 47 1413 94">
     <span class='ocr_line' id='line_1_1' title="bbox 33 47 1413 94; baseline 0 -10; x_size 47; x_descenders 10; x_ascenders 12">
      <span class='ocrx_word' id='word_1_1' title='bbox 33 50 109 84; x_wconf 95; x_fsize 48'>Has</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 125 59 244 94; x_wconf 96; x_fsize 48'>anyone</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 247 47 347 84; x_wconf 96; x_fsize 48'>been</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 350 59 408 84; x_wconf 96; x_fsize 48'>as</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 411 47 495 84; x_wconf 95; x_fsize 48'>far</span>
      <span class='ocrx_word' id='word_1_6' title='bbox 511 59 549 84; x_wconf 95; x_fsize 48'>as</span>
      <span class='ocrx_word' id='word_1_7' title='bbox 565 59 656 84; x_wconf 95; x_fsize 48'>even</span>
      <span class='ocrx_word' id='word_1_8' title='bbox 672 47 824 84; x_wconf 96; x_fsize 48'>decided</span>
      <span class='ocrx_word' id='word_1_9' title='bbox 838 54 874 84; x_wconf 96; x_fsize 48'>to</span>
      <span class='ocrx_word' id='word_1_10' title='bbox 889 54 986 84; x_wconf 96; x_fsize 48'>want</span>
      <span class='ocrx_word' id='word_1_11' title='bbox 999 54 1036 84; x_wconf 96; x_fsize 48'>to</span>
      <span class='ocrx_word' id='word_1_12' title='bbox 1052 47 1098 84; x_wconf 96; x_fsize 48'>do</span>
      <span class='ocrx_word' id='word_1_13' title='bbox 1113 47 1201 84; x_wconf 96; x_fsize 48'>look</span>
      <span class='ocrx_word' id='word_1_14' title='bbox 1215 59 1294 84; x_wconf 96; x_fsize 48'>more</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 1297 47 1413 84; x_wconf 95; x_fsize 48'>like.</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

The value you're looking for is x_fsize; the value's accuracy depends on providing the correct DPI.


IIRC, I've seen the font name attribute (x_font) in the output for tesseract 5, but it wasn't working just now for whatever reason; maybe because there are too few words in the image.

caerulescens commented 2 years ago

@int3l Can you please close this; it's not an issue with the library.

agutkarsh commented 2 years ago

@caerulescens Thanks, I got that