[Enhancement]: Propagate ocr confidence to output hocr file

openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab

https://gitlab.gnome.org/World/OpenPaperwork/pyocr

930 stars 152 forks source link

[Enhancement]: Propagate ocr confidence to output hocr file #86

Closed a-pagano closed 6 years ago

a-pagano commented 6 years ago

This PR allows to parse the individual word confidence measures from Tesseract output and write them to the simplified output hocr file in the title attribute of the Box objects.

Example output: <span class="ocrx_word" title="bbox 638 1797 751 1823; x_wconf 70">Word</span>

Note: directly relates to #74 and #58 and less so to #12

a-pagano commented 6 years ago

Thank for the quick review!

a-pagano commented 6 years ago

Yes I did ran the tests. I have 8, 3 and 4 failing tests for libtesseract, tesseract and cuneiform respectively. I suspect most of the tests fail because of the comparisons of the output of the ocr with the content of the .words and .files files present in the tests/output/ folder since they do not contain the confidence measure.

jflesch commented 6 years ago

Good point, test reference outputs will have to be updated. I'll take care of that later.

a-pagano commented 6 years ago

Also, what should be the default behaviour if you're parsing an hocr file that doesn't have a confidence measure. Should the Box confidence attribute be set to 0 as it's done with cuneiform? I'm asking because with the change you requested in the __parse_confidence a trace will be logged and nothing is returned (effectively returning None) -- > the confidence attribute of the Box object is set to None --> the call to get_xml_tag breaks because it's expecting a digit in the string it's printing

jflesch commented 6 years ago

Good point, I missed the fact that it's not returning anything.

I think 0 is fine. If it wasn't used before, we can safely assume it won't be used now. Or we can play it even safer and set it to -1. I assume Tesseract doesn't use negative values for the confidence ? (personally, I haven't looked at the confidence scores yet)

a-pagano commented 6 years ago

Finding any information about the confidence measure is very hard (not much in the documentation, nothing in the changelog). I could however find a related issue in a cached version of some now defunct code.google thread. It seems that some versions of tesseract (<=3.02) had negative confidence values (between 0 and -7 :/). In the more recent versions it's a number between 0 and 100 (%).

So if I think defaulting it to 0 is the safest bet.

jflesch commented 6 years ago

Looks good to me. I'll update the tests later this week and then do a release with this new feature. Thank you :-)

a-pagano commented 6 years ago

No problem! Thank you for your work on this project, it's been super useful for a project where I work 👍

crazzle commented 6 years ago

Hi, is there a plan when this feature will result in a new release? Thanks!

jflesch commented 6 years ago

Sorry, I've been busy with personal matters (moving to another flat, etc) and I forgot to do the release :( (thanks for reminding me :).

I'll try to do it this evening (France ; GMT+1).

crazzle commented 6 years ago

Awesome! Thanks a lot :)

jflesch commented 6 years ago

Done : https://pypi.python.org/pypi?:action=display&name=pyocr&version=0.5 :)

crazzle commented 6 years ago

Thanks!