konstantint / PassportEye

Extraction of machine-readable zone information from passports, visas and id-cards via OCR
MIT License
382 stars 110 forks source link

choosing tessdata to get more accuracy #46

Closed SaddamBInSyed closed 4 years ago

SaddamBInSyed commented 4 years ago

Hi @konstantint

Thanks for your work. I am using this lib to extract MRZ values from the National ID. I am not satisfied with the current performance which I am getting from this MRZPipeline() class. so I would like to ask

  1. How to improve my accuracy.
  2. How to use the tessdata_best trained files?

I have installed tesseract 4.0 (tesseract-ocr-setup-4.00.00dev.exe) setup and I get the below files in tessdata folder

image

please advise

konstantint commented 4 years ago

To use a different model, specify extra_cmdline_params="-l osd" (assuming osd.traineddata is the new model you created).

As for improving the accuracy - besides trying to train a dedicated tesseract model (although, I must admit, I do not know of examples where one managed to obtain statistically significant benefits with custom models), perhaps you could make sure the input images are as clear as possible.

One common issue, that is handled very poorly by the current implementation has to do with the situation, where the document lies on some kind of a patterned background (e.g. a table).

You can try running the mrz script with the --save_roi parameter on the badly recognized examples and examine the regions extracted by the pipeline. If the region is correct (i.e. includes the actual MRZ in correct orientation), tuning tesseract is the way to go. If the region is usually incorrect, then the problem lies in the image preprocessing.

If you discover an useful way to process images which you think should be added to the current PassportEye pipeline, let me know!