madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.76k stars 715 forks source link

PyTesseract image_to_data dataframe output cannot read the word "None". #555

Open DeFrayne opened 2 months ago

DeFrayne commented 2 months ago

Run the below code and check the data frame output - the word "None" shows up as "NaN". If you change the word to "None." it displays correctly.

import fitz import pandas import pdfplumber import io

testDocument = fitz.open () # empty PDF testPage = testDocument.newpage () = testPage.insert_text ((100, 100), "Hello World", encoding=fitz.TEXT_ENCODINGLATIN) = testPage.insert_text ((100, 200), "None", encoding=fitz.TEXT_ENCODING_LATIN) plumberDocument = pdfplumber.open (io.BytesIO (testDocument.tobytes())) plumberPage = plumberDocument.pages [0] plumberPageImage = plumberPage.to_image(resolution=300) plumberPageImage.show()

testImageBytes = io.BytesIO() plumberPageImage.save (testImageBytes, fomrat="PNG") pillowImage = PillowImage.open (testImageBytes)

pandas.set_option('display.max_rows', None) pytesseract.image_to_data (pillowImage, lang="eng", config="--psm 12 --oem 1", output_type=pytesseract.Output.DATAFRAME)

stefan6419846 commented 2 months ago

This is a pandas configuration option. I am using a line like pandas_config['converters'] = dict(text=str) and pass the pandas_config to image_to_data accordingly. In theory, you should be able to use any (anonymous or regular) function instead of str to satisfy your needs.

DeFrayne commented 2 months ago

Thank you for the response, but I am not following how to do this - can you show me some inline code as an example?

stefan6419846 commented 2 months ago

In the most basic case, just use the following line:

pytesseract.image_to_data(pillowImage, lang="eng", config="--psm 12 --oem 1", output_type=pytesseract.Output.DATAFRAME, pandas_config={"converters": {"text": str}})
DeFrayne commented 2 months ago

Thank you for clarifying!