tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.1k stars 9.39k forks source link

Feature Request: Word miner & transcription #1066

Closed ghost closed 7 years ago

ghost commented 7 years ago

Peace be upon you, @theraysmith here is a feature that I think you'll find interesting

Computer Assisted Transcription:

glyph

theraysmith commented 7 years ago

Interesting! Thanks for the pointer. Recognita had this feature in the mid 1990s. The best ideas always return.

The difficulty is the same as for any character-level OCR, which is character segmentation, although it probably can learn the difference between rn and m, with a few examples.

On Sat, Aug 5, 2017 at 4:28 AM, chris notifications@github.com wrote:

Peace be upon you, @theraysmith https://github.com/theraysmith here is a feature that I think you'll find interesting

Computer Assisted Transcription:

  • The software segments the pages to lines.
  • Then segments the lines into words.
  • Later-on, allow the user to transcribe words or glyphs, once the user is satisfied, the software then searches for all instances of presence of such words or glyphs, and automatically transcribe them all in all instances.
  • Can transcribe both Glyphs & Words, depending on the segmentation level you choose. https://github.com/benedikt-budig/glyph-miner

[image: glyph] https://www.youtube.com/watch?v=T-p_kIdsn6k

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1066, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056ZTjupK40e886o_oXNgeOzkrIfc-ks5sVFHpgaJpZM4OucA6 .

-- Ray.