sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2.02k stars 255 forks source link

Image to data #300

Closed nandhiniukl closed 2 years ago

nandhiniukl commented 2 years ago

Hi,

Is there any similar approach to this in tessocr?

import pandas as pd custom_config = r'-l eng --oem 1 --psm 6' data = pytesseract.image_to_data(thresh, config=custom_config, output_type=Output.DICT) df = pd.DataFrame(data)

stefan6419846 commented 2 years ago

No, there is no simple wrapper to generate data frames for pandas in tesserocr. By the way, your code seems to be more complex than required. The following code should do the same:

import pytesseract

custom_config = r'-l eng --oem 1 --psm 6'
df = pytesseract.image_to_data(thresh, config=custom_config, output_type=pytesseract.Output.DATAFRAME)

Depending on your requirements, a similar output can be produced with some additional wrapper method. For simplicity, I will leave out the modes and language for now, which should be easy to integrate when generating the PyTessBaseAPI instance.

If you are interested in the text and confidences only and already want to filter for confidence != -1 (corresponds to empty text), you can use the following wrapper method:

import pandas
from tesserocr import PyTessBaseAPI

def tesserocr_to_pandas(image):
    with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4/tessdata') as api:
        api.SetImage(image)
        api.Recognize()  # Requirement for calling `AllWords`.
        words = api.AllWords()
        confidences = api.AllWordConfidences()

    return pandas.DataFrame(data=dict(text=words, conf=confidences))

If you want the full TSV output including headers, you can use the following wrapper method:

import csv
from io import StringIO

import pandas
from tesserocr import PyTessBaseAPI

def tesserocr_to_pandas(image, config=None):
    with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4/tessdata') as api:
        api.SetImage(image)
        result = api.GetTSVText(0)  # One page only.

    kwargs = {'quoting': csv.QUOTE_NONE, 'sep': '\t'}
    try:
        kwargs.update(config)
    except (TypeError, ValueError):
        pass
    # `result` does not have the header names, therefore we have to define them manually.
    kwargs['names'] = ['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text']

    return pandas.read_csv(StringIO(result), **kwargs)