thomasgruebl / rusty-tesseract

A Rust wrapper for Google Tesseract
MIT License
124 stars 14 forks source link

Feature Request: add hOCR output support #13

Open STRRL opened 11 months ago

STRRL commented 11 months ago

Hi! rusty-tesseract is amzaing work! It works pretty well on my both Linux and MacOS machine!

I have used it on my personal project https://github.com/strrl/dejavu, and I found that I require more detailed information like page, paragraph, line, not only the "word". ref: https://github.com/STRRL/dejavu/issues/7

I found that both alto and hOCR output could make it possible, and both of them are XML-based output. And I prefer to hOCR because it seems it still keeps updating, https://github.com/kba/hocr-spec/

So here is my proposal:

How do you think about it? :heart:

I could draft a PR for that.

thomasgruebl commented 11 months ago

Hi, thanks for raising this issue and glad to hear that you like rusty_tesseract!

Tesseract (and rusty_tesseract) already provide the option to output in hOCR format by setting the 'tessedit_create_hocr' flag to '1'.

Consider lines 31-40 in the main.rs file: You can simply add the hOCR flag to the config_variables HashMap as follows:

let image_to_string_args = Args {
        lang: "eng".into(),
        config_variables: HashMap::from([
        (
            "tessedit_char_whitelist".into(),
            "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".into(),
        ),
        ("tessedit_create_hocr".into(), "1".into())]),
        dpi: Some(150),
        psm: Some(6),
        oem: Some(3),
    };

Then the rusty_tesseract::image_to_string() output looks as follows:

The String output is: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 4.1.1' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "/tmp/rusty-tesseractkxwqOh.png"; bbox 0 0 696 89; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 18 29 671 64">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 18 29 671 64">
     <span class='ocr_line' id='line_1_1' title="bbox 18 29 671 64; baseline 0 -1; x_size 44.862743; x_descenders 11.215686; x_ascenders 11.215686">
      <span class='ocrx_word' id='word_1_1' title='bbox 18 29 162 64; x_wconf 95'>LOREM</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 181 29 304 64; x_wconf 91'>IPSUM</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 323 29 476 64; x_wconf 91'>DOLOR</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 490 29 540 64; x_wconf 96'>SIT</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 553 30 671 63; x_wconf 96'>AMET</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

However, it might not be entirely clear for new users that such a config flag exists within tesseract, so please feel free to create a new function image_to_hocr that automatically appends the tessedit_create_hocr flag to the config_variables HashMap.

P.S. Similarly, you can append the tessedit_create_alto flag to the config_variables or any other flag that is listed in the tesseract --print-parameters list.

Thanks,

Thomas