konstantint / PassportEye

Extraction of machine-readable zone information from passports, visas and id-cards via OCR
MIT License
374 stars 109 forks source link

conversion from float64 to uint8. #27

Closed henrytom1703 closed 5 years ago

henrytom1703 commented 5 years ago

code:

from passporteye import read_mrz

mrz = read_mrz('/home/tomtony/Downloads/ICAO_Example.png')

Error

/home/tomtony/python_env/test-env/bin/python /home/tomtony/sourcecode/fx001/backend/TestH6.py
WARNING:root:Lossy conversion from float64 to uint8. Range [0, 1]. Convert image to uint8 prior to saving to suppress this warning.
Traceback (most recent call last):
  File "/home/tomtony/sourcecode/fx001/backend/TestH6.py", line 3, in <module>
    mrz = read_mrz('/home/tomtony/Downloads/ICAO_Example.png')
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/mrz/image.py", line 337, in read_mrz
    mrz = p.result
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/mrz/image.py", line 325, in result
    return self['mrz_final']
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/util/pipeline.py", line 102, in __getitem__
    self._compute(key)
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/util/pipeline.py", line 109, in _compute
    self._compute(d)
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/util/pipeline.py", line 111, in _compute
    results = self.components[cname](*inputs)
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/mrz/image.py", line 187, in __call__
    roi, text, mrz = self.box_to_mrz(b, img, img_small, scale_factor)
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/mrz/image.py", line 227, in __call__
    text = ocr(roi, extra_cmdline_params=self.extra_cmdline_params)
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/passporteye/util/ocr.py", line 45, in ocr
    config=config)
  File "/home/tomtony/python_env/test-env/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 194, in run_tesseract
    raise TesseractError(status_code, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")

Process finished with exit code 1
viACHIZv commented 5 years ago

I am facing this same issue as well. It appears to be affecting the 1.3.0 version. 1.2.4 works fine.

konstantint commented 5 years ago

Yes, it stems from yesterday's 1.3.0 update.

In 1.3.0 I hard-coded Tesseract to use the "legacy" recognizer (rather than the LSTM one), because it seems to work better in pretty much all the cases I've tested it. However, it seems that some installations of Tesseract v4 do not come with language files appropriate for the legacy version.

So, in theory, your problem would resolve itself if you downloaded eng.traindata from here and put it into Tesseract's tessdata directory instead of whatever you currently have there (most probably you have eng.traindata which is around 5MB in size, while the "legacy+new" eng.traindata is around 30MB).

Another quick hack is to pass extra_cmdline_params='--oem 3' (meaning "use LSTM engine, when possible, or Legacy otherwise) to read_mrz.

Given the circumstances, perhaps forcing the use of legacy was not a very user-friendly decision for PassportEye. It is not clear which would be a better resolution for this:

  1. Leave everything as is and write a lengthy explanation in the docs, stating the need to upgrade the training data files or use --oem 3 extra cmdline param. Better quality out of the box for those who have legacy data files or read the docs. Ugly user experience for those who don't.
  2. Use the "oem 3" mode by default (as was the case in 1.2.4) and recommend those who have a legacy engine to pass --oem 0 as an extra cmdline arg. Better user experience for everyone. Worse quality out of the box (because all new tesseracts use a "newer" model by default).
  3. Do some kind of autodetection, where we first run Tesseract on a dummy file to see whether it fails. Better user experience out of the box for everyone, slower recognition times.
  4. Carry the necessary tessdata with PassportEye. Good user experience and no need to do any autodetection, but significantly larger package size (30MB), unclear licensing implications (on the other hand, it would actually be nice to ship a custom-trained model for the OCR-B font anyway).

I'll probably go the 2nd way for now (that'll be 1.4.0 then) but opinions are welcome.

konstantint commented 5 years ago

I removed the forced use of the legacy recognizer in 1.4.0, however if you want noticeably better results, I highly recommend you install the legacy "traineddata" files and use --legacy flag with the mrz script or, correspondingly extra_cmdline_params='--oem 0' with the read_mrz function.