Extraction, Verification and Masking of Aadhaar UIDs from photos and scanned documents.
The solution to the problem involves use of PyTesseract Optical Character Recognition engine and OpenCV for image processing. It can be divided into 3 Sub-Tasks:
This task can be further divided into 2 Sub-Tasks: 1) Preprocess the image and then use PyTesseract library to extract all recognizable text from the image with their corresponding bounding boxes. 2) Use RegEx Search to find possible UID candidates. Aadhaar contains 12 numeric digits, so any 12-digit no. in the text returned by the OCR engine can be a possible UID.
The problem now is that image may need some pre-processing before it is possible to extract text from it. There are may factors affecting the performance of Tesseract engine, such as Orientation, Noise, Resolution, Illumination etc. For tackling these problems, we use the following pipeline:
a) Try without any processing. b) If (a) doesn’t work, try using OpenCV’s Gaussian Blur to remove random noise, then try again. c) If (b) doesn’t work, rotate the image by 90 degrees and try (a) and (b) again.
In this way steps (a), (b) and (c) are repeated 4 times (for 0, 90, 180 and 270 degrees rotation) and if at any point UID candidates are found, we stop (as all UIDs in the image can be found in that particular setting). In case these steps fail to produce desired results, we produce the super resolution version of the image using ESRGAN and retry with the pipeline described above.
In this step we try to filter the invalid UIDs using the Verhoeff Algorithm as there can be many unintended RegEx matches that are not of use. It is basically a checksum validation method. We use OpenCV’s functions to black out the first 8 digits of every UID with the help of character wise bounding boxes found in the previous step.
In our solution pipeline we use some algorithms such as:
1) Google Colab : Used as the development environment. 2) NumPy : Used for handling high dimensional arrays. 3) OpenCV and PIL : Used for image processing. 4) PyTesseract : Used for OCR. 5) RegEx : Used for Regular Expression searches. 6) img2pdf and pdf2image : Used for handling .pdf files. 7) ISR : Used for generating Super Resolution images.
Thanks for going through this Repository! Have a nice day. Got any Queries? Feel free to contact me. Saini Rohan Rao