This example page includes some French characters (accents). Initial tests indicate these could be whitelisted into the Tesseract character set. This could be very important, and it'd be nice to see some examples of the OCR with and without. For the production version, this would imply looking at how we'd configured tesseract for an OCR, and if that could be done dynamically.
Conversion to Absolute coordinates
When small regions are selected, the HOCR output will respond w/ the local coordnates of the text box. These need to be converted to the original page coordinates. For un-rotated imagery, this is pretty straight-forward, you just add the coordinates of the UL corner to the results, (The resollution to the IIF call has to be FULL however, for that to work.) For rotated imagery, it's more complicated, the the resultant boxes with no longer be aligned to the original page. (See Google VISION API Output for method to handle that)
Box to form input
When going from a set of tesseract boxes to a form field, multiple boxes will often be used to fill one form field. The form input format will need to have a method to show that relationship.
Google VISION API output format
It's been awhile since we've used the Google Vision API, But the original API allowed you to do a number of searches, including text. For example, this White Wine Label, was scaned with the google-vision API. The google-vision file shows the format they use. Notice in this example, that the LL -> LR line describes the orientation, and that rotated text can be easily combined in this format.
TODO
Tesseract Whitelisting
Conversion to Absolute coordinates
When small regions are selected, the HOCR output will respond w/ the local coordnates of the text box. These need to be converted to the original page coordinates. For un-rotated imagery, this is pretty straight-forward, you just add the coordinates of the UL corner to the results, (The resollution to the IIF call has to be FULL however, for that to work.) For rotated imagery, it's more complicated, the the resultant boxes with no longer be aligned to the original page. (See Google VISION API Output for method to handle that)
Box to form input
When going from a set of tesseract boxes to a form field, multiple boxes will often be used to fill one form field. The form input format will need to have a method to show that relationship.
Google VISION API output format
It's been awhile since we've used the Google Vision API, But the original API allowed you to do a number of searches, including text. For example, this White Wine Label, was scaned with the google-vision API. The google-vision file shows the format they use. Notice in this example, that the LL -> LR line describes the orientation, and that rotated text can be easily combined in this format.