Word-Level OCR - Githubissues

ghost commented 6 years ago

Hi there, Currently I had an idea of Word-Level training, basically it focuses on word-level recognition rather than the line-level & character-level that Tesseract is currently using.

The idea is simple:

First, you train a new language model using a Word List that contains a list of the words of a certain language. Thus, generating an image for each word, and train on it.
Latter-on, when you want to recognize the text of a scanned text image, Tesseract will segment the image into lines, then segment those lines into words, and finally use your trained model to recognize those words.

Now my question is, why such idea will or won't work?

@theraysmith @amitdo @Shreeshrii @stweil @kba

kba commented 6 years ago

This is called "word spotting". See https://sci-hub.tw/10.1016/j.patcog.2017.02.023 for a 2017 survey on word spotting and related technologies.

The issues here are only for bugs in the software, best to use the mailing list for such questions.

amitdo commented 6 years ago

segment those lines into words

How do you suggest to segment the words?

and finally use your trained model to recognize those words.

Again, which method to use here? Using a language model is only a small part of the recognition process.

ghost commented 6 years ago

@amitdo

For recognition, TPP-PHOCNet is the state-of-the-art wordspotting method, and it's open sourced! It achieves ~98% QbE & QbS rates, surpassing BLSTM by 14%!
For line segmentation, Tesseract will do, but if you want something fancy, there is ocroseg from Ocropus3.
The problem is the word segmentation, and I was hoping you'd suggest some.

Though I wonder after finding a word detection/ segmentation method, can Tesseract still be used to recognize the segmented words?

@kba Some latest and interesting word-spotting research: Deep Learning for Word Spotting Exploring Architectures for CNN-Based Word Spotting Attribute CNNs for Word Spotting in Handwritten Documents

amitdo commented 6 years ago

Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting

https://www.cs.tau.ac.il/~wolf/papers/magicocr.pdf

ghost commented 6 years ago

@amitdo is there any good methods available that can segment lines into words? perhaps even in github?

@kba Thanks for the survey, it was interesting.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/wordseg.cpp

This is used by the legacy engine.

ghost commented 6 years ago

@amitdo it was very hard to find a word segmentation tool, but I just found an interesting repo, it's called Image2code also have a look at their paper. Whats interesting about their project is that they are using a dynamic white-space threshold, which means that their tool calculates for each line separately the average space between each word within it.

That means that their tool is suitable for both printed and handwriting word segmentation. They even released a project called Image2lines, which segments lines.

amitdo, can you test Image2code word segmentation?

amitdo commented 6 years ago

A few points:

I don't know how many hours a month is spent by Ray to develop Tesseract these days.
Although Tesseract is very popular software, we don't have a lot of outside contributors.
It's already quite hard to provide support for Tesseract.

My conclusion is that we should focus on small improvements to the existing CNN-LSTM engine.

That's my personal opinion.

ghost commented 6 years ago

@amitdo I do understand. My current idea is: 1) Create a huge .txt word-list using Crunch. 2) Generate synthesized words from that wordlist only words, no white-space, perhaps by kraken. 3) Train a new language model using those word chunks.

To recognize a scanned document:

Segment pages into lines.
Segment lines into words.
Use the trained Word-level language model.

ghost commented 6 years ago

I was not talking about word-spotting, I was talking about Word Detection. I think I found what I was looking for: https://github.com/gaxler/dataset_agnostic_segmentation http://www.cs.tau.ac.il/~wolf/papers/dataset-agnostic-word.pdf

@amitdo @kba @stweil what do you think about my ocr pipe line below?

Training

Create a huge .txt word-list using Crunch.
Generate synthesized words from that .txt wordlist ,only words & no white-space, by kraken linegen.
Train a new language model using those word chunks. Tesseract or Calamari.

Recognition

Segment pages into lines, using ocropus3 ocrseg.
Segment lines into words, using Agnostic Segmentation & heatmaps.
Use the trained Word-level language model to recognize the segmented word images.

amitdo commented 6 years ago

Nice find, but my previous answer still applies.

kba commented 6 years ago

@christophered

Create a huge .txt word-list using Crunch.

Given how small a fraction of possible permutations of letters are words of natural languages, I would not recommend that. For brute-forcing passwords sure, but not for OCR.

Generate synthesized words from that .txt wordlist ,only words & no white-space, by kraken linegen.

Not sure whether training an engine on words only makes sense because internally they will train on/predict sequences of glyphs. Also skeptical whether word segmentation and recognition of individual words beats one-step recognition of a line with implicit word segmnetation. IMHO it would make more sense to use a word-level language model (or even a spell checker with a custom dict) in post-processing of the results.

But I haven't measured and wouldn't want to discourage you, best of luck and feel free to share your results. BTW, since you've been trying out a lot of software wrt OCR/segmentation etc, PR to https://github.com/kba/awesome-ocr are very welcome :-)

ghost commented 6 years ago

@kba @stweil @amitdo In order for me to convert a Framed image into Line labels, I need the Bounding Box location. How can I create/ tag box locations from an image?

111-compressed

kba commented 6 years ago

Off the top of my head: ocropy encodes bboxes in PNG convertible to hOCR, kraken generates JSON, tesseract API with line iterator with some coding... All would be fairly straightforward to serialize into a list of x0,y0,x1,y1 coordinates.

Kishlay-notabot commented 10 months ago

Even though this issue is closed, I'd like to mention a amazing tool for other people wandering here: https://github.com/githubharald/WordDetector

there's another project which applies the above word detector tool on an OCR program [for devanagari] https://github.com/subhrajyotidasgupta/DevanagariHTR

I am working on a project which requires both OCR and word detection before it, So I was researching about it and was curious that does Tesseract do page segmentation nicely? Thanks.

tesseract-ocr / tesseract

Word-Level OCR #1886