Open IsmailM opened 2 years ago
HVFES
) works (or should).VF reports are of 3 layout types, we are mostly interested in V3. For HVFES
to detect a V3 format, input JPG
needs to be:
1400px
, which implies JPG
with 300dpi
;Date of Birth
detected (so, the big square covering that field won't help);
A 200dpi
could work for general extraction, but it will be deemed V1. A 400dpi
is, believe it or not, worse!
Apparently, tesseract
has some issue with black boxes (readacts) near text and dpi resolution.
For example (for processed head slice in B&W):
$ tesseract f5.jpg f5 --dpi 300 && grep -i date f5.txt # Won't work!
# but
$ tesseract f5.jpg f5 --dpi 200 && grep -i date f5.txt # is fine
Date of Birth:
If I use Mac Preview redact tool or white
boxes (instead of black), results are much better.
header_slice
from image redacted Mac Preview:
Any image whose width is less than 2500px
will be resized to that minimum width. In my tests I don't see the need for that if we stick with modern scans.
In hvf_object.py
:
# First, need to upscale image if its too low resolution (important for older HVF
# images). Min width is a bit arbitrary but is close to ~300ppi
width = np.size(hvf_image, 1)
MIN_HVF_WIDTH = 2500
Image is tuned to gray, try to detect layout, and then converted B&W for text extraction (def get_header_metadata_from_hvf_image()
).
Layout detection is important because HVFES
split the images in several sections (to optimise OCR according to the authors).
Header is split in 3 or 4 parts (depending on the layout):
header_slice_image1
(for V2 and V3)
Middle
header_slice_image2
(only V2)
header_slice_image3
(only V2)
header_slice_image_middle
(only for V3)
header_slice_image4
(for V2 and V3)
As it can be seen, applying to the wrong layout will hinder the extraction.
Other slices, used in def get_metric_metadata_from_hvf_image()
, according to layout:
dev_val_slice_imageV2
dev_val_slice_imageV3
Most changes are being done at visual_fields_extraction/hvf_extraction_script/hvf_data/hvf_object.py
, where I added the new key fields for extraction.
To improve the efficient of the new field detection, we should use a list of choices for each new field, assuming those fields have restricted choices:
Gaze/Blind Spot
(what else?)Central
(what else?)III, White
(OCR usually gets "Ill, White"
) (what else?)31.5 ASB
(what else?)We need to improve layout detection. So, first, we cannot redact the words "Date of Birth:" (but we must redact the DD/MM/YYYY field).
If we want gender, same as above.
I'm investigating how to make tesseract
to handle better black boxes. I'm tweaking the code to help me debug it as well.
I've done as much as I could for now.
Date of Birth
key cannot be redacted.Things we can do:
version
in the bottom of the reports.aws rekognition
?We need more reports to test. And if we're going to ever use V2 layout, we need several examples as well.
The following fields are not currently being extracted:
This should be implemented something like the following:
https://github.com/msaifee786/hvf_extraction_script/blob/e978747233887322e66fd7537b7269eb00be1d55/hvf_extraction_script/hvf_data/hvf_object.py#L1055-L1071