madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.76k stars 715 forks source link

Force text column of dataframe to be of string type #530

Closed mausam3407 closed 6 months ago

mausam3407 commented 8 months ago

In the cases, where ocr recognizes a decimal value like 0.00 or 1992, dataframe conversion converts the text to 0.0 and 1992.0 which is float type. To maintain the originality of text, I have forced the text column to be of str type.

stefan6419846 commented 8 months ago

Could you please add a corresponding test as well?

mausam3407 commented 8 months ago

@stefan6419846 Do we want a test case specific to this scenario or just a test case to assert that text column is of str type?

stefan6419846 commented 8 months ago

We probably should have a test where the common cases are covered, id est the text being a real string in one line and a number in another line.

mausam3407 commented 8 months ago

@stefan6419846 This issue is only when ocring over a image which only contains numbers like i have mentioned. So I have to add another image for that.

stefan6419846 commented 8 months ago

Then we should probably have two tests, although one might already exist.

mausam3407 commented 8 months ago

@stefan6419846
Yeah one is there. I'll add for another one.

mausam3407 commented 8 months ago

@stefan6419846 Hi! I have added the test cases.

mausam3407 commented 7 months ago

@bozhodimitrov Hi! Can you please take a look into this PR?

grantrosse commented 6 months ago

you can use the pandas_config parameter of image_to_data() to pass the argument to the embedded read_csv(): df = pytesseract.image_to_data(image, lang=, output_type='data.frame', config=, pandas_config={'dtype': {'text': str}})