Realistically, we're going to wind up using Tesseract. Tesseract has improved a lot in the past few years, but commercial software is still better. But Tesseract is scriptable, we can run it across EC2 instances, it runs headless, we can customize it, and we can train it.
You can see an academic paper's report on a comparison of Tesseract to FineReader The authors' conclusion is that neither is inherently better in a general sense, but have the following strengths and weaknesses:
Tesseract doesn't handle complicated or "noisy" page layouts very well
FineReader must be trained manually, while Tesseract's training is automatic
For good quality pages, Tesseract gives better word-level results (as opposed to letter-level results) than FineReader
Realistically, we're going to wind up using Tesseract. Tesseract has improved a lot in the past few years, but commercial software is still better. But Tesseract is scriptable, we can run it across EC2 instances, it runs headless, we can customize it, and we can train it.
The commercial alternatives are Abbyy FineReader and [OmniPage Ultimate(http://www.nuance.com/for-business/by-product/omnipage/ultimate/). Both are Windows-only.
You can see an academic paper's report on a comparison of Tesseract to FineReader The authors' conclusion is that neither is inherently better in a general sense, but have the following strengths and weaknesses: