konstantint / PassportEye

Extraction of machine-readable zone information from passports, visas and id-cards via OCR
MIT License
380 stars 109 forks source link

Understand valid_score #17

Closed alexandrubujor closed 6 years ago

alexandrubujor commented 6 years ago

Hi,

I am using PassportEye to read data from an ID and I am trying to understand the meaning of valid_score parameter. Is it an estimation of how well the document was OCR-ized?

I got something like valid_score = 5 for a document which was read pretty well.

See bellow:

mrz_type TD3 valid_score 5 type P< country ARE number ZCOL99623 date_of_birth 581212 expiration_date M20103 nationality ARE sex 9 names ABDULLA AHMED EBRAHEEM JASIM surname ALHOSANI personal_number 5<<<<<<<<<<<<< check_number 5 check_date_of_birth 1 check_expiration_date 1 check_composite < check_personal_number < valid_number False valid_date_of_birth False valid_expiration_date False valid_composite False valid_personal_number False method direct walltime 2.7720367908477783 filename /opt/arab-id.png

Using tesseract as bellow:

$ tesseract -v tesseract 3.05.01 leptonica-1.75.3 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7

Thanks, Alex

konstantint commented 6 years ago

valid_score is an ad-hoc correctness check. See the docs of the MRZ class:

        The parsing computes three validation indicators:
            valid_check_digits - a list of booleans indicating which of the "check digits" in the MRZ were valid.
                                TD1/TD2 has four check digits, TD3 - five, MRVA/B - three.
                                The separate booleans are also available as valid_number, valid_date_of_birth, valid_expiration_date, valid_composite
                                and valid_personal_number (TD3 only).
            valid_line_lengths - a list of booleans, indicating which of the lines (3 in TD1, 2 in TD2/TD3) had the expected length.
            valid_misc         - a list of booleans, indicating various additional validity checks (unspecified, see code).
        The valid_score field counts the "validity score" according to the flags above and is an int between 0 and 100.

Here's the actual computation:

        self.valid_score = 10*sum(self.valid_check_digits) + sum(self.valid_line_lengths) + sum(self.valid_misc) + 1
        self.valid_score = 100*self.valid_score//(40+3+1+1)

In your case you can see that the checksums for the date of birth and expiration date are wrong (in particular, expiration date is not even a date). The personal code is lacking along with the corresponding check digit, hence these are also deemed wrong. The composite check digit is also lacking and hence wrong.

It may be the case that the particular document you are reading is simply not following the standards of how the check digits should be computed.