opentext-idol / idol-rich-media-tutorials

A set of guides to get you doing great things with IDOL Media Server!
Other
6 stars 3 forks source link

How can I do OCR on a large dataset of identity card images? #1

Open Emilia96 opened 1 month ago

Emilia96 commented 1 month ago

Hi @chris-blanks-mf , I followed your guide about ID cards ocr. If I try to extract text only from a single id card (the same one used for creating template), ocr works well. When I try to process another similar ID Card, it doesn't work well. My question is: How can I do OCR on a large dataset of identity card images, extracting name, surname and date of birth from each one?

Can you tell me how I can solve this problem?

Thanks a lot. Best regards, Emilia

chris-blanks-mf commented 1 month ago

Hi Emilia,

There are two main approaches:

  1. OCR + Eduction

    • First run OCR to extract all the text from a document, outputting to plain text. You can quickly preview how that might look with the demo GUI page.
    • Next run IDOL Eduction, using grammars to extract names and dates from that plaint text, as in this tutorial.
  2. Structured OCR

    • First identify each type of ID card you will be scanning.
    • Next generate a template that's unique to each identity card type, including a good "anchor image" (Object Recognition candidate) symbol/logo, as well as the regions that contain the text you wish to scan. This process was followed in this tutorial.

Strengths and weaknesses

Method 1. You are free to work with any ID card type, including formats you don't currently know about, however you may have compounded errors, as in any error in the original OCR will hurt the downstream Eduction and you need a pipeline including both Media Server and Eduction Server.

Method 2. You can accurately capture text preserving the original structure of the ID card and tune the OCR for the expected character sets e.g. dates only numbers and punctuation, however you need to know which ID cards you will work with and need a good anchor image to match against.

Emilia96 commented 1 month ago

Hi @chris-blanks-mf, thanks for your answer. I followed Structured OCR approach but I am not able to extract name, surname and date of birth from a different id card of the same type. I verified that on the identity card on which I pulled the anchor, the regions are correctly drawn. On a second identity card of the same type, some regions are positioned incorrectly. Both ID cards were scanned the same way.

  1. What do you mean by a good anchor image?
  2. Do I need a training phase with a dataset of id cards of the same type? If so, how does this training happen?

Thanks in advance, Emilia

chris-blanks-mf commented 1 month ago

The "anchor image" in this case it the top of the Turkish Drivers license, i.e. show here and trained with Object Recognition.

Before running OCR, Media Server tries to match that anchor image. If detection, then it's location is used as a reference for the OCR bounding boxes, so if that does not work well, you will potentially get the wrong OCR bounding boxes.

Emilia96 commented 1 month ago

Hi Chris, unfortunately I can not solve the problem.

  1. Once I have decided on the anchor image, when I outline the AnchorBoxPixels pixels, do I need to be extremely accurate in relation to the chosen image?

  2. If I have two identity cards of the same template but the two images contain a different number of pixels, is the anchor still OK? or do I need to adapt the scripts/regions to work?

Thanks in advance, Emilia

chris-blanks-mf commented 1 month ago
  1. Yes, you should be accurate about the OCR bounding boxes.
  2. No, you can define regions in percent, rather than pixels to be independent of the actual size of the image

The tutorial you've working from expects those OCR regions to be in percent: https://github.com/opentext-idol/idol-rich-media-tutorials/blob/main/tutorials/showcase/id-card-ocr/README.md#add-the-ocr-regions-to-the-template

Emilia96 commented 1 month ago

I set the regions as a percentage just like in your tutorial. Unfortunately, the regions are captured correctly only on the image from which I drew them. If I try to insert another image of the same type, it doesn’t work anymore.

Here are the two images I’m trying to process and I’m having problems with.

id1 id2

I still extracted them from the id1 image. Can you help me understand the problem? Surely id1 has more pixels than id2, but shouldn’t it work just as well?

Also, if we want to use eduction, I can extract all the text from the image but I can’t differentiate the text I want to take. For example, if the text comes back to me all in a string of characters how can I know if that word is a first name or a last name, if very often the last name can also be equal to a first name and so on?

If it’s easier for you, we can also arrange a short meeting so it’s easier to deal with it.