qco / bookliberator

Free and open source software to liberate text from treeware.
GNU Affero General Public License v3.0
0 stars 0 forks source link

Identify a dewarping algorithm / library #2

Open slifty opened 4 years ago

slifty commented 4 years ago

One of the exports of this project is a de-warped version of each page photographed. This will be used to improve the OCR as well.

This is not out-of-the box functionality! Lets find some existing libraries and algorithms that do this (ideally in Java, but if there is no other choice it might be OK to have it in another language and find a way to run it from within the app).

This issue is ultimately a research issue, to capture and log resources as I find them.

slifty commented 4 years ago

here does not appear to be a perfect solution for de-warping, and that will be a risk to the project overall (but one we will better understand once the MVP is complete). The two risks are:

  1. Computational intensity (since this is for a mobile device)
  2. Quality of the final result.

Here is a short thread on the DIY Bookscanner forums of someone who appears to have tried to make exactly what we're talking about making here. They are pointed to a few resources, though I am a bit wary of going down the path of completely implementing something from scratch based on academic papers.

That thread does note that there is no single way to do it because it is really just a heuristic. Even the best algorithms produce odd or bogus results with a fair bit of frequency. Especially on pages where the typology does not match the assumptions above. Say, on a map or title page.

Which does raise my concerns -- we may find that some pages simply cannot be reliably scanned / dewarped. Again, MVP will expose that challenge.

Approach A: Modify an existing algoirthm

This guide from 2016 and the accompanying code offers what appears to be a fairly compelling de-warping algorithm, though it is in python. This algorithm takes around 30 seconds to de-warp a page on a 2012 Macbook Pro.

Approach B: OpenCV

There are a few projects (such as OpenNoteScanner) which appear to use OpenCV to handle de-warping.

This article from 2014 shows an example in Python which could be modified to Java.

Approach C: TensorFlow

This blog post from 2019 talks about the use of TensorFlow / Machine Learning. Unfortunately they note that Geometric correction in the second step requires massive computational power, and it is not feasible to conduct it solely on-device at the moment.

slifty commented 4 years ago

I spoke with @kfogel on this item and it is understood that (1) dewarping is a preprocessing step that is going to improve the outcome of OCR and (2) neither dewarping nor OCR is a perfectly solved problem.

To that end, we are going to follow the 80/20 rule and see what comes from an initial implementation with the understanding that there will be room for significant improvement, but that improvement should be explored after that first iteration.

kfogel commented 4 years ago

Just heard about another project that might have some useful references or code: https://gitlab.com/rstocker/scanner

@slifty, if there's some place (other than this issue) where you'd like me to put information about related projects, please let me know. We could create a separate document in the tree for that, or make a section in an existing document later, or whatever. I don't want these notes to be distracting, I just want to have a place to keep possibly-useful references. Even after we evaluate them, it's good to keep a record of what we evaluated, so that neither we nor others need to retrace those steps later.

kfogel commented 4 years ago

Ask HN: OCR framework for extracting formatted text has a lot of links too.

slifty commented 4 years ago

Awesome thank you for these @kfogel -- this is a fine place for them for now, and we can make another place for related projects later.

kfogel commented 3 years ago

One more: https://github.com/Ethereal-Developers-Inc/OpenScan:

"An open source app that enables users to scan hardcopies of documents or notes and convert it to a PDF file. No ads. No data collection. We respect your privacy."

(They don't say anything about OCR; not sure if that's included, or planned for the roadmap, or just not something they're doing.)

kfogel commented 3 years ago

One more: https://wiki.gnome.org/Apps/OCRFeeder