Yes! I would love that...
Actually I'm testing some tesseract apps to get the better base to try to get the source code to give me a hOCR version of the found text...
Because the OCR is working quite well but the text restitution is really bad for bills, as the white spaces have to be kept, but generally OCR apps just concatenate the found texts from tesseract's boxes without keeping the position information so the "paragraph" mess for bills is not easily usable.
With a hOCR format restitution, it would be possible for an external app to position the text back to get lines and columns of a bill or any "not book page" formating in fact...
Yes! I would love that... Actually I'm testing some tesseract apps to get the better base to try to get the source code to give me a hOCR version of the found text... Because the OCR is working quite well but the text restitution is really bad for bills, as the white spaces have to be kept, but generally OCR apps just concatenate the found texts from tesseract's boxes without keeping the position information so the "paragraph" mess for bills is not easily usable. With a hOCR format restitution, it would be possible for an external app to position the text back to get lines and columns of a bill or any "not book page" formating in fact...