10-16 Meeting notes - Githubissues

DerekMaggio commented 5 years ago

TODO

Tesseract Whitelisting

This example page includes some French characters (accents). Initial tests indicate these could be whitelisted into the Tesseract character set. This could be very important, and it'd be nice to see some examples of the OCR with and without. For the production version, this would imply looking at how we'd configured tesseract for an OCR, and if that could be done dynamically.

Conversion to Absolute coordinates

When small regions are selected, the HOCR output will respond w/ the local coordnates of the text box. These need to be converted to the original page coordinates. For un-rotated imagery, this is pretty straight-forward, you just add the coordinates of the UL corner to the results, (The resollution to the IIF call has to be FULL however, for that to work.) For rotated imagery, it's more complicated, the the resultant boxes with no longer be aligned to the original page. (See Google VISION API Output for method to handle that)

Box to form input

When going from a set of tesseract boxes to a form field, multiple boxes will often be used to fill one form field. The form input format will need to have a method to show that relationship.

Google VISION API output format

It's been awhile since we've used the Google Vision API, But the original API allowed you to do a number of searches, including text. For example, this White Wine Label, was scaned with the google-vision API. The google-vision file shows the format they use. Notice in this example, that the LL -> LR line describes the orientation, and that rotated text can be easily combined in this format.

{
  "text": [
    {
      "desc": "Dae 1939.\nDae , s\", is 39,\nk]\nPROPRIETAIRES-NEGOCI ANTS\nTAIN-L,HERMITAGE\nPRODUCT OF FRANCE\nF-P.5,\n1874\nALCO HOL 13 % BY VOL.\n& urs\n[R\n",
      "bounds": [],
},
    {
      "desc": "1874",
      "bounds": [
        {
          "x": 1692,
          "y": 640
        },
        {
          "x": 1747,
          "y": 645
        },
        {
          "x": 1732,
          "y": 816
        },
        {
          "x": 1677,
          "y": 811
        }
      ]
    },
    {
      "desc": "HERMITAGE",
      "bounds": [
        {
          "x": 1183,
          "y": 1058
        },
        {
          "x": 1475,
          "y": 1056
        },
        {
          "x": 1475,
          "y": 1079
        },
        {
          "x": 1183,
          "y": 1081
        }
      ]
    },

qjhart commented 5 years ago

@DerekMaggio, these TODOs can become Issues.

DerekMaggio commented 5 years ago

Google Vision API requirement has been added to issue #10

DerekMaggio commented 5 years ago

Tesseract Whitelisting Issue: #24

DerekMaggio commented 5 years ago

Box to form Issue: #25

DerekMaggio commented 5 years ago

Conversion to Absolute Coordinates: #26

ucd-library / csus-sp-2018-app

10-16 Meeting notes #22

TODO

Tesseract Whitelisting

Conversion to Absolute coordinates

Box to form input

Google VISION API output format