Closed ghost closed 6 years ago
This is called "word spotting". See https://sci-hub.tw/10.1016/j.patcog.2017.02.023 for a 2017 survey on word spotting and related technologies.
The issues here are only for bugs in the software, best to use the mailing list for such questions.
segment those lines into words
How do you suggest to segment the words?
and finally use your trained model to recognize those words.
Again, which method to use here? Using a language model is only a small part of the recognition process.
@amitdo
TPP-PHOCNet
is the state-of-the-art wordspotting method, and it's open sourced!
It achieves ~98% QbE & QbS rates, surpassing BLSTM by 14%!Though I wonder after finding a word detection/ segmentation method, can Tesseract still be used to recognize the segmented words?
@kba Some latest and interesting word-spotting research: Deep Learning for Word Spotting Exploring Architectures for CNN-Based Word Spotting Attribute CNNs for Word Spotting in Handwritten Documents
Improving OCR for an Under-Resourced Script Using Unsupervised Word-Spotting
@amitdo is there any good methods available that can segment lines into words? perhaps even in github?
@kba Thanks for the survey, it was interesting.
https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/wordseg.cpp
This is used by the legacy engine.
@amitdo it was very hard to find a word segmentation
tool, but I just found an interesting repo, it's called Image2code also have a look at their paper.
Whats interesting about their project is that they are using a dynamic white-space threshold
, which means that their tool calculates for each line separately the average space between each word within it.
That means that their tool is suitable for both printed and handwriting word segmentation. They even released a project called Image2lines, which segments lines.
amitdo, can you test Image2code word segmentation?
A few points:
My conclusion is that we should focus on small improvements to the existing CNN-LSTM engine.
That's my personal opinion.
@amitdo I do understand.
My current idea is:
1) Create a huge .txt
word-list using Crunch.
2) Generate synthesized words from that wordlist only words, no white-space
, perhaps by kraken.
3) Train a new language model using those word chunks.
To recognize a scanned document:
I was not talking about word-spotting, I was talking about Word Detection. I think I found what I was looking for: https://github.com/gaxler/dataset_agnostic_segmentation http://www.cs.tau.ac.il/~wolf/papers/dataset-agnostic-word.pdf
@amitdo @kba @stweil what do you think about my ocr pipe line below?
Training
Crunch.
kraken linegen.
Calamari.
Recognition
ocropus3 ocrseg.
Agnostic Segmentation & heatmaps.
Nice find, but my previous answer still applies.
@christophered
Create a huge .txt word-list using Crunch.
Given how small a fraction of possible permutations of letters are words of natural languages, I would not recommend that. For brute-forcing passwords sure, but not for OCR.
Generate synthesized words from that .txt wordlist ,only words & no white-space, by kraken linegen.
Not sure whether training an engine on words only makes sense because internally they will train on/predict sequences of glyphs. Also skeptical whether word segmentation and recognition of individual words beats one-step recognition of a line with implicit word segmnetation. IMHO it would make more sense to use a word-level language model (or even a spell checker with a custom dict) in post-processing of the results.
But I haven't measured and wouldn't want to discourage you, best of luck and feel free to share your results. BTW, since you've been trying out a lot of software wrt OCR/segmentation etc, PR to https://github.com/kba/awesome-ocr are very welcome :-)
@kba @stweil @amitdo In order for me to convert a Framed image into Line labels, I need the Bounding Box location. How can I create/ tag box locations from an image?
Off the top of my head: ocropy encodes bboxes in PNG convertible to hOCR, kraken generates JSON, tesseract API with line iterator with some coding... All would be fairly straightforward to serialize into a list of x0,y0,x1,y1 coordinates.
Even though this issue is closed, I'd like to mention a amazing tool for other people wandering here: https://github.com/githubharald/WordDetector
there's another project which applies the above word detector tool on an OCR program [for devanagari] https://github.com/subhrajyotidasgupta/DevanagariHTR
I am working on a project which requires both OCR and word detection before it, So I was researching about it and was curious that does Tesseract do page segmentation nicely? Thanks.
Hi there, Currently I had an idea of
Word-Level training
, basically it focuses on word-level recognition rather than the line-level & character-level that Tesseract is currently using.The idea is simple:
Word List
that contains a list of the words of a certain language. Thus, generating an image for each word, and train on it.lines
, then segment those lines intowords
, and finally use your trained model to recognize those words.Now my question is, why such idea will or won't work?
@theraysmith @amitdo @Shreeshrii @stweil @kba