ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

How can able to recognize turkish characters with ocropy #225

Open mustafaxfe opened 7 years ago

mustafaxfe commented 7 years ago

Hi, I want to recognize Turkish characters while scanning image file with ocropy. It passes Turkish characters. I have also edited chars.py as default = ascii+xsymbols+german+french+portuguese+turkish But noting changed. What should I do/configure to recognize Turkish characters. Thanks in advance.

Steps to Reproduce (for bugs)

I have given this commands:

  1. ocropus-nlbin /home/mustafa/Downloads/IMG_20170612_185324.jpg -o book
  2. ocropus-gpageseg 'book/????.bin.png'
  3. ocropus-rpred -Q 4 'book/????/??????.bin.png'
  4. cat book/????/??????.txt > ocr.txt But It can't recognize Turkish Characters, instead to it gives as this

    Basingli hava, elektrik motoru ya da igten yanmal motorlarla

    But it should be

Basınçlı hava, elektrik motoru ya da içten yanmalı motorlarla

My Environment

  • Python version:Python 2.7.12
  • Git revision of ocropy: Master
  • Operating System and version: KDE Neon - 5.10.1
zuphilip commented 7 years ago

You would need a model trained on Turkish characters. The file chars.py will influence the training steps ocropus-rtrain but not the recognition. Some shared ocropus models can be found here https://github.com/tmbdev/ocropy/wiki/Models , however there seems to no model for Turkish characters so far.

mustafaxfe commented 7 years ago

Is it possible to train for Turkish characters? If yes, how can I train a model for Turkish characters

zuphilip commented 7 years ago

You can either change the chars.py before training or use the option ocropus-rtrain -c <FILES> which will construct a codec from the input text of the indicated files.

mustafaxfe commented 7 years ago

I have edited chars.py to set Turkish characters and followed this tutorial ( https://github.com/cisocrgroup/OCR-Workshop/blob/master/presentations/m5-incunabula-practice.md) to train for Turkish. But I am confused how many time it takes for training. A day or a week also It has been creating lots of files...

ghost commented 7 years ago

@mustafaxfe can you send the chars.py file here

mustafaxfe commented 7 years ago

My chars.py file:

-- encoding: utf-8 --

import re

common character sets

digits = u"0123456789" letters = u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" symbols = ur"""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~""" ascii = digits+letters+symbols

xsymbols = u"""€¢£»«›‹÷©®†‡°∙•◦‣¶§÷¡¿▪▫""" german = u"ÄäÖöÜüß" french = u"ÀàÂâÆæÇçÉéÈèÊêËëÎîÏïÔôŒœÙùÛûÜüŸÿ" turkish = u"ĞğŞşıſ" greek = u"ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω" portuguese = u"ÁÃÌÍÒÓÕÚáãìíòóõú" telugu = u" ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహఽాిీుూృౄెేైొోౌ్ౘౙౠౡౢౣ౦౧౨౩౪౫౬౭౮౯" default = ascii+xsymbols+german+french+portuguese+turkish

european = default+turkish+greek

List of regular expressions for normalizing Unicode text.

Cleans up common homographs. This is mostly used for

training text.

Note that the replacement of pretty much all quotes with

ASCII straight quotes and commas requires some

postprocessing to figure out which of those symbols

represent typographic quotes. See requote

TODO: We may want to try to preserve more shape; unfortunately,

there are lots of inconsistencies between fonts. Generally,

there seems to be left vs right leaning, and top-heavy vs bottom-heavy

replacements = [ (u'[_~#]',u"~"), # OCR control characters (u'"',u"''"), # typewriter double quote (u"`",u"'"), # grave accent (u'[“”]',u"''"), # fancy quotes (u"´",u"'"), # acute accent (u"[‘’]",u"'"), # left single quotation mark (u"[“”]",u"''"), # right double quotation mark (u"“",u"''"), # German quotes (u"„",u",,"), # German quotes (u"…",u"..."), # ellipsis (u"′",u"'"), # prime (u"″",u"''"), # double prime (u"‴",u"'''"), # triple prime (u"〃",u"''"), # ditto mark (u"µ",u"μ"), # replace micro unit with greek character (u"[–—]",u"-"), # variant length hyphens (u"fl",u"fl"), # expand Unicode ligatures (u"fi",u"fi"), (u"ff",u"ff"), (u"ffi",u"ffi"), (u"ffl",u"ffl"), ]

def requote(s): s = unicode(s) s = re.sub(ur"''",u'"',s) return s

def requote_fancy(s,germanic=0): s = unicode(s) if germanic:

germanic quoting style reverses the shapes

    # straight double quotes
    s = re.sub(ur"\s+''",u"”",s)
    s = re.sub(u"''\s+",u"“",s)
    s = re.sub(ur"\s+,,",u"„",s)
    # straight single quotes
    s = re.sub(ur"\s+'",u"’",s)
    s = re.sub(ur"'\s+",u"‘",s)
    s = re.sub(ur"\s+,",u"‚",s)
else:
    # straight double quotes
    s = re.sub(ur"\s+''",u"“",s)
    s = re.sub(ur"''\s+",u"”",s)
    s = re.sub(ur"\s+,,",u"„",s)
    # straight single quotes
    s = re.sub(ur"\s+'",u"‘",s)
    s = re.sub(ur"'\s+",u"’",s)
    s = re.sub(ur"\s+,",u"‚",s)
return s