Closed a-pagano closed 6 years ago
Thank for the quick review!
Yes I did ran the tests. I have 8, 3 and 4 failing tests for libtesseract, tesseract and cuneiform respectively. I suspect most of the tests fail because of the comparisons of the output of the ocr with the content of the .words
and .files
files present in the tests/output/
folder since they do not contain the confidence measure.
Good point, test reference outputs will have to be updated. I'll take care of that later.
Also, what should be the default behaviour if you're parsing an hocr file that doesn't have a confidence measure. Should the Box confidence attribute be set to 0 as it's done with cuneiform?
I'm asking because with the change you requested in the __parse_confidence
a trace will be logged and nothing is returned (effectively returning None) -- > the confidence attribute of the Box object is set to None --> the call to get_xml_tag
breaks because it's expecting a digit in the string it's printing
Good point, I missed the fact that it's not returning anything.
I think 0 is fine. If it wasn't used before, we can safely assume it won't be used now.
Or we can play it even safer and set it to -1
. I assume Tesseract doesn't use negative values for the confidence ? (personally, I haven't looked at the confidence scores yet)
Finding any information about the confidence measure is very hard (not much in the documentation, nothing in the changelog). I could however find a related issue in a cached version of some now defunct code.google thread. It seems that some versions of tesseract (<=3.02) had negative confidence values (between 0 and -7 :/). In the more recent versions it's a number between 0 and 100 (%).
So if I think defaulting it to 0 is the safest bet.
Looks good to me. I'll update the tests later this week and then do a release with this new feature. Thank you :-)
No problem! Thank you for your work on this project, it's been super useful for a project where I work 👍
Hi, is there a plan when this feature will result in a new release? Thanks!
Sorry, I've been busy with personal matters (moving to another flat, etc) and I forgot to do the release :( (thanks for reminding me :).
I'll try to do it this evening (France ; GMT+1).
Awesome! Thanks a lot :)
Thanks!
This PR allows to parse the individual word confidence measures from Tesseract output and write them to the simplified output hocr file in the title attribute of the Box objects.
Example output:
<span class="ocrx_word" title="bbox 638 1797 751 1823; x_wconf 70">Word</span>
Note: directly relates to #74 and #58 and less so to #12