miller-center / cpc-issues

Connecting Presidential Collections
Other
0 stars 0 forks source link

Select OCR software #23

Open waldoj opened 10 years ago

waldoj commented 10 years ago

Realistically, we're going to wind up using Tesseract. Tesseract has improved a lot in the past few years, but commercial software is still better. But Tesseract is scriptable, we can run it across EC2 instances, it runs headless, we can customize it, and we can train it.

The commercial alternatives are Abbyy FineReader and [OmniPage Ultimate(http://www.nuance.com/for-business/by-product/omnipage/ultimate/). Both are Windows-only.

You can see an academic paper's report on a comparison of Tesseract to FineReader The authors' conclusion is that neither is inherently better in a general sense, but have the following strengths and weaknesses: