tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.03k stars 9.49k forks source link

Word-Level OCR #1886

Closed ghost closed 6 years ago

ghost commented 6 years ago

Hi there, Currently I had an idea of Word-Level training, basically it focuses on word-level recognition rather than the line-level & character-level that Tesseract is currently using.

The idea is simple:

Now my question is, why such idea will or won't work?

@theraysmith @amitdo @Shreeshrii @stweil @kba

kba commented 6 years ago

This is called "word spotting". See https://sci-hub.tw/10.1016/j.patcog.2017.02.023 for a 2017 survey on word spotting and related technologies.

The issues here are only for bugs in the software, best to use the mailing list for such questions.

amitdo commented 6 years ago

segment those lines into words

How do you suggest to segment the words?

and finally use your trained model to recognize those words.

Again, which method to use here? Using a language model is only a small part of the recognition process.

ghost commented 6 years ago

@amitdo

Though I wonder after finding a word detection/ segmentation method, can Tesseract still be used to recognize the segmented words?

@kba Some latest and interesting word-spotting research: Deep Learning for Word Spotting Exploring Architectures for CNN-Based Word Spotting Attribute CNNs for Word Spotting in Handwritten Documents

amitdo commented 6 years ago

Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting

https://www.cs.tau.ac.il/~wolf/papers/magicocr.pdf

ghost commented 6 years ago

@amitdo is there any good methods available that can segment lines into words? perhaps even in github?

@kba Thanks for the survey, it was interesting.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/wordseg.cpp

This is used by the legacy engine.

ghost commented 6 years ago

@amitdo it was very hard to find a word segmentation tool, but I just found an interesting repo, it's called Image2code also have a look at their paper. Whats interesting about their project is that they are using a dynamic white-space threshold, which means that their tool calculates for each line separately the average space between each word within it.

That means that their tool is suitable for both printed and handwriting word segmentation. They even released a project called Image2lines, which segments lines.

amitdo, can you test Image2code word segmentation?

amitdo commented 6 years ago

A few points:

My conclusion is that we should focus on small improvements to the existing CNN-LSTM engine.

That's my personal opinion.

ghost commented 6 years ago

@amitdo I do understand. My current idea is: 1) Create a huge .txt word-list using Crunch. 2) Generate synthesized words from that wordlist only words, no white-space, perhaps by kraken. 3) Train a new language model using those word chunks.

To recognize a scanned document:

ghost commented 6 years ago

I was not talking about word-spotting, I was talking about Word Detection. I think I found what I was looking for: https://github.com/gaxler/dataset_agnostic_segmentation http://www.cs.tau.ac.il/~wolf/papers/dataset-agnostic-word.pdf

@amitdo @kba @stweil what do you think about my ocr pipe line below?

Training

Recognition

amitdo commented 6 years ago

Nice find, but my previous answer still applies.

kba commented 6 years ago

@christophered

Create a huge .txt word-list using Crunch.

Given how small a fraction of possible permutations of letters are words of natural languages, I would not recommend that. For brute-forcing passwords sure, but not for OCR.

Generate synthesized words from that .txt wordlist ,only words & no white-space, by kraken linegen.

Not sure whether training an engine on words only makes sense because internally they will train on/predict sequences of glyphs. Also skeptical whether word segmentation and recognition of individual words beats one-step recognition of a line with implicit word segmnetation. IMHO it would make more sense to use a word-level language model (or even a spell checker with a custom dict) in post-processing of the results.

But I haven't measured and wouldn't want to discourage you, best of luck and feel free to share your results. BTW, since you've been trying out a lot of software wrt OCR/segmentation etc, PR to https://github.com/kba/awesome-ocr are very welcome :-)

ghost commented 6 years ago

@kba @stweil @amitdo In order for me to convert a Framed image into Line labels, I need the Bounding Box location. How can I create/ tag box locations from an image?

111-compressed

kba commented 6 years ago

Off the top of my head: ocropy encodes bboxes in PNG convertible to hOCR, kraken generates JSON, tesseract API with line iterator with some coding... All would be fairly straightforward to serialize into a list of x0,y0,x1,y1 coordinates.

Kishlay-notabot commented 10 months ago

Even though this issue is closed, I'd like to mention a amazing tool for other people wandering here: https://github.com/githubharald/WordDetector

there's another project which applies the above word detector tool on an OCR program [for devanagari] https://github.com/subhrajyotidasgupta/DevanagariHTR

I am working on a project which requires both OCR and word detection before it, So I was researching about it and was curious that does Tesseract do page segmentation nicely? Thanks.