Closed gabyiawad closed 2 years ago
pytesseract just calls Tesseract itself here, so this most likely is not related to pytesseract at all, as these are just the parsed results of the Tesseract binary.
You can easily verify this yourself by some monkey-patching to get the actual underlying call for example (this will error out for the actual subprocess call, but this should not matter here as we are interested in the arguments only):
pytesseract.pytesseract.subprocess.Popen = lambda *args, **kwargs: print(args, kwargs)
This should result in roughly
tesseract 68747470733a2f2f692e737461636b2e696d6775722e636f6d2f35744c504d2e6a7067 result -l eng -c 'tessedit_create_tsv=1' --psm 6 -c 'tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
when translated into an actual call compatible to Bash.
The results would be in result.tsv
(which is a tab-separated file). This should have the same values as running pytesseract, allowing you to validate that pytesseract does not actually mess with the confidences.
I am using pyteseract 5.0.1.20220118 with python 3.9.7 , I need to detect text from license plate with certain white list conditions and get the confidence for each detection.I am using the code below for the recognition.
Output:
height conf text
0 36 -1.000000 NaN
1 36 -1.000000 NaN
2 36 -1.000000 NaN
3 36 -1.000000 NaN
4 26 92.998444 B
5 27 95.961960 708569
6 36 96.753922 |
The | character is not an allowed character so i use the code below to to add whitlelisted characters.
Output::
height conf text
0 36 -1.0 NaN
1 36 -1.0 NaN
2 36 -1.0 NaN
3 36 -1.0 NaN
4 36 0.0 B708569]
Now the confidence of the detections is wrong.
Below is the image used for testing.
[]
Environment
Current Behavior: when using tessedit_char_whitelist i am getting wrong confidence values.
Expected Behavior: return same confidence when not using the tessedit_char_whitelist but with allowed charachters only in the result
Suggested Fix: