kba / hocrjs

Working with hOCR in Javascript
http://kba.cloud/hocrjs
119 stars 17 forks source link

Error: Unknown property 'res' in 'bbox 0 0 449 867; image './1.png'; ppageno 1; res 100; rot 0; scan_res 100 100' #68

Closed khashashin closed 1 year ago

khashashin commented 1 year ago

I got an error which is written in the title for following hocr file:

<!DOCTYPE html>
<html>
<head>
 <title>1.html</title>
 <meta charset="utf-8" /> 
 <meta name='ocr-system' content='tesseract 5.3.0' />
 <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
 <div title="bbox 0 0 449 867; image './1.png'; ppageno 1; res 100; rot 0; scan_res 100 100" class="ocr_page" id="page_1">
  <div title="bbox 17 399 388 552" class="ocr_carea" id="carea_1_1">
   <p title="bbox 17 399 388 552" class="ocr_par" id="par_1_1">
    <span title="baseline 0.013 -10; bbox 28 443 377 478; x_ascenders 11.75; x_descenders 7.9166665; x_size 43.166668" class="ocr_line" id="line_1_1">
     <span title="bbox 28 443 142 473; x_fsize 31; x_wconf 75" class="ocrx_word" id="word_1_1" lang="ru">ДИНАН</span>
     <span title="bbox 163 443 325 469; x_fsize 31; x_wconf 75" class="ocrx_word" id="word_1_2" lang="ru">КУЛЬТАШ</span>
     <span title="bbox 345 446 377 478; x_fsize 31; x_wconf 36" class="ocrx_word" id="word_1_3" lang="ru">А,</span>
    </span>
    <span title="baseline 0 -5; bbox 63 486 348 515; x_ascenders 11.75; x_descenders 7.9166665; x_size 43.166668" class="ocr_line" id="line_1_2">
     <span title="bbox 63 486 138 511; x_fsize 31; x_wconf 8" class="ocrx_word" id="word_1_4" lang="ru">УЬШ</span>
     <span title="bbox 159 486 348 515; x_fsize 31; x_wconf 56" class="ocrx_word" id="word_1_5" lang="ru">ДӀАЯХАРАН</span>
    </span>
    <span title="baseline 0 -15; bbox 110 527 296 552; x_ascenders 11.75; x_descenders 7.9166665; x_size 43.166668" class="ocr_line" id="line_1_3">
     <span title="bbox 110 527 261 552; x_fsize 31; x_wconf 79" class="ocrx_word" id="word_1_7" lang="ru">НЕКЪАШ</span>
     <span title="bbox 274 528 296 551; x_fsize 31; x_wconf 96" class="ocrx_word" id="word_1_8" lang="ru">А</span>
    </span>
    <span title="baseline 0 0; bbox 17 399 388 434; x_ascenders 10; x_descenders 10; x_size 40" class="ocr_line" id="line_1_4">
     <span title="bbox 17 399 191 434; x_font MS Shell Dlg 2; x_fsize 26; x_wconf 100" class="ocrx_word" id="word_1_9" lang="ru">НОХЧИЙН,</span>
     <span title="bbox 207 399 388 433; x_font MS Shell Dlg 2; x_fsize 26; x_wconf 100" bold="0" class="ocrx_word" id="word_1_10" italic="0" lang="ru">ГӀАЛГӀАЙН</span>
    </span>
   </p>
  </div>
  <div title="bbox 92 77 297 107" class="ocr_carea" id="carea_1_2">
   <p title="bbox 92 77 297 107" class="ocr_par" id="par_1_3">
    <span title="baseline 0 0; bbox 92 77 297 107; x_ascenders 8.5; x_descenders 8.5; x_size 34" class="ocr_line" id="line_1_5">
     <span title="bbox 92 79 118 107; x_font MS Shell Dlg 2; x_fsize 21; x_wconf 100" class="ocrx_word" id="word_1_11" lang="ru">А.</span>
     <span title="bbox 124 77 151 106; x_font MS Shell Dlg 2; x_fsize 21; x_wconf 100" bold="0" class="ocrx_word" id="word_1_12" italic="0" lang="ru">И.</span>
     <span title="bbox 159 78 297 105; x_font MS Shell Dlg 2; x_fsize 21; x_wconf 100" bold="0" class="ocrx_word" id="word_1_13" italic="0" lang="ru">ШАМИЛЕВ</span>
    </span>
   </p>
  </div>
 </div>
 <script src="https://unpkg.com/hocrjs"></script>
</body>
</html>

This hOCR was generated using the tool https://github.com/manisandro/gImageReader

image

stweil commented 1 year ago

That error was created by gImageReader, so I suggest to report it there unless you get also an error when using only hocrjs. It is not possible to reproduce the issue without the related image 1.png.

kba commented 1 year ago

The problem is the res property which is not part of the hOCR spec. I think this should be scan_res, i.e. in https://github.com/manisandro/gImageReader/blob/8a046ec7bf64e996fbc81187306334fbd11873c9/qt/src/hocr/OutputEditorHOCR.cc#L268:

-   attrs["res"] = QString::number(pageInfos.resolution);
+   attrs["scan_res"] = QString::number(pageInfos.resolution);
khashashin commented 1 year ago

@kba thanks for the clarification, would you like to open the issue in gImageReader and point out the problem of non-compliance with hOCR specification? @manisandro is there any reason why it not hOCR spec compliance?

manisandro commented 1 year ago

No reason except an oversight (resp probably I added the attribute with the name res without realizing that there was a standardized attribute name for the same purpose). PRs welcome!