Table Extraction Benchmark Usecase

Hope this finds you well. In the midst of my recent deep dive into OCRs, I found myself in a conundrum. We've got this galaxy of OCR tools—tabula, doctr, textract, paddle, donut, and the list goes on. Each with their own merits, but how do we objectively measure them?

I've been toying with an idea: writing an open source OCR benchmark system.

Here's a breakdown of what the project aims to achieve, in a tentative order of implementation:

Initial milestone:

Curate some public data sets.
Create a pip-friendly CLI.
Automate OCR installation.
Test OCR quality (WER) and OCR speed.

Expansion push:

Allow users to test their own data.
Use Python notebooks for visually appealing reports.
Keep the report notebooks clean, with a thoughtful reporting API.

Stretch goals:

Run tests on a series of OCR releases - Who's improving? Who has frequent regressions?
Meta-analysis; interpret results from multiple experiments.
Evaluate textract's cost-effectiveness in user use cases (WER-Delta-Per-Dollar-Vs-Doctr) 🤡

Now, I'm floating this to you because I respect your acumen and think you could be the catalyst to take this from concept to reality. But here's the catch, and it's a significant one for me: I'm a die-hard advocate for keeping this venture firmly rooted in the open-source ethos, specifically under the AGPL. The repo is currently MIT, and I'd be keen on transitioning it to AGPL.

If this aligns with your principles and you're up for a challenge, then let’s talk! If not, no hard feelings. It's crucial we're on the same wavelength from the get-go.

Eager to hear your thoughts!

katanaml / sparrow

Table Extraction Benchmark Usecase #28