Open STRRL opened 11 months ago
Hi, thanks for raising this issue and glad to hear that you like rusty_tesseract!
Tesseract (and rusty_tesseract) already provide the option to output in hOCR format by setting the 'tessedit_create_hocr' flag to '1'.
Consider lines 31-40 in the main.rs file: You can simply add the hOCR flag to the config_variables HashMap as follows:
let image_to_string_args = Args {
lang: "eng".into(),
config_variables: HashMap::from([
(
"tessedit_char_whitelist".into(),
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".into(),
),
("tessedit_create_hocr".into(), "1".into())]),
dpi: Some(150),
psm: Some(6),
oem: Some(3),
};
Then the rusty_tesseract::image_to_string() output looks as follows:
The String output is: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 4.1.1' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "/tmp/rusty-tesseractkxwqOh.png"; bbox 0 0 696 89; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 18 29 671 64">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 18 29 671 64">
<span class='ocr_line' id='line_1_1' title="bbox 18 29 671 64; baseline 0 -1; x_size 44.862743; x_descenders 11.215686; x_ascenders 11.215686">
<span class='ocrx_word' id='word_1_1' title='bbox 18 29 162 64; x_wconf 95'>LOREM</span>
<span class='ocrx_word' id='word_1_2' title='bbox 181 29 304 64; x_wconf 91'>IPSUM</span>
<span class='ocrx_word' id='word_1_3' title='bbox 323 29 476 64; x_wconf 91'>DOLOR</span>
<span class='ocrx_word' id='word_1_4' title='bbox 490 29 540 64; x_wconf 96'>SIT</span>
<span class='ocrx_word' id='word_1_5' title='bbox 553 30 671 63; x_wconf 96'>AMET</span>
</span>
</p>
</div>
</div>
</body>
</html>
However, it might not be entirely clear for new users that such a config flag exists within tesseract, so please feel free to create a new function image_to_hocr
that automatically appends the tessedit_create_hocr
flag to the config_variables HashMap.
P.S. Similarly, you can append the tessedit_create_alto
flag to the config_variables or any other flag that is listed in the tesseract --print-parameters
list.
Thanks,
Thomas
Hi! rusty-tesseract is amzaing work! It works pretty well on my both Linux and MacOS machine!
I have used it on my personal project https://github.com/strrl/dejavu, and I found that I require more detailed information like page, paragraph, line, not only the "word". ref: https://github.com/STRRL/dejavu/issues/7
I found that both
alto
andhOCR
output could make it possible, and both of them are XML-based output. And I prefer to hOCR because it seems it still keeps updating, https://github.com/kba/hocr-spec/So here is my proposal:
image_to_hocr
, and output is the string which the content is the xml-based hOCRHow do you think about it? :heart:
I could draft a PR for that.