add updated code for new data extractor function

shrivastava95 / odia-dictionary

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.

4 stars 1 forks source link

add updated code for new data extractor function #15

Open shrivastava95 opened 1 year ago

shrivastava95 commented 1 year ago

Added a new, better image_to_string_v2 method using CLIP zero shot classification on top of Tesseract to better distinguish between images of different languages before they are parsed by OCR.

A demonstration of CLIP's ability to recognize words from different languages is shown below. It can be seen to assign the language for every bounding box on the below image correctly: