Closed nandhiniukl closed 2 years ago
No, there is no simple wrapper to generate data frames for pandas in tesserocr. By the way, your code seems to be more complex than required. The following code should do the same:
import pytesseract
custom_config = r'-l eng --oem 1 --psm 6'
df = pytesseract.image_to_data(thresh, config=custom_config, output_type=pytesseract.Output.DATAFRAME)
Depending on your requirements, a similar output can be produced with some additional wrapper method. For simplicity, I will leave out the modes and language for now, which should be easy to integrate when generating the PyTessBaseAPI instance.
If you are interested in the text and confidences only and already want to filter for confidence != -1
(corresponds to empty text), you can use the following wrapper method:
import pandas
from tesserocr import PyTessBaseAPI
def tesserocr_to_pandas(image):
with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4/tessdata') as api:
api.SetImage(image)
api.Recognize() # Requirement for calling `AllWords`.
words = api.AllWords()
confidences = api.AllWordConfidences()
return pandas.DataFrame(data=dict(text=words, conf=confidences))
If you want the full TSV output including headers, you can use the following wrapper method:
import csv
from io import StringIO
import pandas
from tesserocr import PyTessBaseAPI
def tesserocr_to_pandas(image, config=None):
with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4/tessdata') as api:
api.SetImage(image)
result = api.GetTSVText(0) # One page only.
kwargs = {'quoting': csv.QUOTE_NONE, 'sep': '\t'}
try:
kwargs.update(config)
except (TypeError, ValueError):
pass
# `result` does not have the header names, therefore we have to define them manually.
kwargs['names'] = ['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text']
return pandas.read_csv(StringIO(result), **kwargs)
Hi,
Is there any similar approach to this in tessocr?
import pandas as pd custom_config = r'-l eng --oem 1 --psm 6' data = pytesseract.image_to_data(thresh, config=custom_config, output_type=Output.DICT) df = pd.DataFrame(data)