phenopolis / visual_fields_extraction

Python scripting framework for extraction data from Humphrey Visual Fields
GNU General Public License v3.0
0 stars 0 forks source link

Add support for extracting more fields #2

Open IsmailM opened 2 years ago

IsmailM commented 2 years ago

The following fields are not currently being extracted:

This should be implemented something like the following:

https://github.com/msaifee786/hvf_extraction_script/blob/e978747233887322e66fd7537b7269eb00be1d55/hvf_extraction_script/hvf_data/hvf_object.py#L1055-L1071

alanwilter commented 2 years ago

On how HVF Extraction script (HVFES) works (or should).

Layout Detection

VF reports are of 3 layout types, we are mostly interested in V3. For HVFES to detect a V3 format, input JPG needs to be:

  1. Width > 1400px, which implies JPG with 300dpi;
  2. Need to have words Date of Birth detected (so, the big square covering that field won't help); Screenshot 2022-07-16 at 11 41 38

A 200dpi could work for general extraction, but it will be deemed V1. A 400dpi is, believe it or not, worse!

Layout Detection Issues

Apparently, tesseract has some issue with black boxes (readacts) near text and dpi resolution.

For example (for processed head slice in B&W):

Screenshot 2022-07-16 at 15 22 35
$ tesseract f5.jpg f5 --dpi 300 && grep -i date f5.txt # Won't work!

# but

$ tesseract f5.jpg f5 --dpi 200 && grep -i date f5.txt # is fine
Date of Birth:

If I use Mac Preview redact tool or white boxes (instead of black), results are much better.

header_slice from image redacted Mac Preview: h1

Image Processing

Any image whose width is less than 2500px will be resized to that minimum width. In my tests I don't see the need for that if we stick with modern scans.

In hvf_object.py:

        # First, need to upscale image if its too low resolution (important for older HVF
        # images). Min width is a bit arbitrary but is close to ~300ppi
        width = np.size(hvf_image, 1)
        MIN_HVF_WIDTH = 2500

Image is tuned to gray, try to detect layout, and then converted B&W for text extraction (def get_header_metadata_from_hvf_image()).

Layout detection is important because HVFES split the images in several sections (to optimise OCR according to the authors).

Header is split in 3 or 4 parts (depending on the layout):

  1. header_slice_image1 (for V2 and V3)

    Screenshot 2022-07-16 at 16 46 40
  2. Middle

  1. header_slice_image4 (for V2 and V3) Screenshot 2022-07-16 at 16 42 27

As it can be seen, applying to the wrong layout will hinder the extraction.

Other slices, used in def get_metric_metadata_from_hvf_image(), according to layout:

  1. dev_val_slice_imageV2

    Screenshot 2022-07-16 at 16 38 33
  2. dev_val_slice_imageV3

    Screenshot 2022-07-16 at 16 39 25

Most changes are being done at visual_fields_extraction/hvf_extraction_script/hvf_data/hvf_object.py, where I added the new key fields for extraction.

alanwilter commented 2 years ago

To improve the efficient of the new field detection, we should use a list of choices for each new field, assuming those fields have restricted choices:

alanwilter commented 2 years ago

We need to improve layout detection. So, first, we cannot redact the words "Date of Birth:" (but we must redact the DD/MM/YYYY field).

If we want gender, same as above.

I'm investigating how to make tesseract to handle better black boxes. I'm tweaking the code to help me debug it as well.

alanwilter commented 2 years ago

I've done as much as I could for now.

Things we can do:

We need more reports to test. And if we're going to ever use V2 layout, we need several examples as well.