Pyteseract image_to_data confidence result not working correctly if configuration tessedit_char_whitelist is used

madmaze / pytesseract

A Python wrapper for Google Tesseract

Apache License 2.0

5.84k stars 721 forks source link

I am using pyteseract 5.0.1.20220118 with python 3.9.7 , I need to detect text from license plate with certain white list conditions and get the confidence for each detection.I am using the code below for the recognition.

 text=pt.image_to_data(roi,lang ='eng', config="--psm 6",  output_type='data.frame')
 print(text)

Output:
height conf text
0 36 -1.000000 NaN
1 36 -1.000000 NaN
2 36 -1.000000 NaN
3 36 -1.000000 NaN
4 26 92.998444 B
5 27 95.961960 708569
6 36 96.753922 |

The | character is not an allowed character so i use the code below to to add whitlelisted characters.

 text=pt.image_to_data(roi,lang ='eng', config="--psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ",  output_type='data.frame')
 print(text)

Output::
height conf text
0 36 -1.0 NaN
1 36 -1.0 NaN
2 36 -1.0 NaN
3 36 -1.0 NaN
4 36 0.0 B708569]
Now the confidence of the detections is wrong.
Below is the image used for testing.

[ enter image description here ]

Environment

Tesseract Version: 5.0.1.20220118
Platform: windows 64 bit

Current Behavior: when using tessedit_char_whitelist i am getting wrong confidence values.

Expected Behavior: return same confidence when not using the tessedit_char_whitelist but with allowed charachters only in the result

Suggested Fix:

pytesseract just calls Tesseract itself here, so this most likely is not related to pytesseract at all, as these are just the parsed results of the Tesseract binary.

You can easily verify this yourself by some monkey-patching to get the actual underlying call for example (this will error out for the actual subprocess call, but this should not matter here as we are interested in the arguments only):

pytesseract.pytesseract.subprocess.Popen = lambda *args, **kwargs: print(args, kwargs)

This should result in roughly

tesseract 68747470733a2f2f692e737461636b2e696d6775722e636f6d2f35744c504d2e6a7067 result -l eng -c 'tessedit_create_tsv=1' --psm 6 -c 'tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'

when translated into an actual call compatible to Bash.

The results would be in result.tsv (which is a tab-separated file). This should have the same values as running pytesseract, allowing you to validate that pytesseract does not actually mess with the confidences.

madmaze / pytesseract