Open mustafaxfe opened 7 years ago
You would need a model trained on Turkish characters. The file chars.py
will influence the training steps ocropus-rtrain
but not the recognition. Some shared ocropus models can be found here https://github.com/tmbdev/ocropy/wiki/Models , however there seems to no model for Turkish characters so far.
Is it possible to train for Turkish characters? If yes, how can I train a model for Turkish characters
You can either change the chars.py
before training or use the option ocropus-rtrain -c <FILES>
which will construct a codec from the input text of the indicated files.
I have edited chars.py to set Turkish characters and followed this tutorial ( https://github.com/cisocrgroup/OCR-Workshop/blob/master/presentations/m5-incunabula-practice.md) to train for Turkish. But I am confused how many time it takes for training. A day or a week also It has been creating lots of files...
@mustafaxfe can you send the chars.py file here
My chars.py file:
-- encoding: utf-8 --
import re
common character sets
digits = u"0123456789" letters = u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" symbols = ur"""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~""" ascii = digits+letters+symbols
xsymbols = u"""€¢£»«›‹÷©®†‡°∙•◦‣¶§÷¡¿▪▫""" german = u"ÄäÖöÜüß" french = u"ÀàÂâÆæÇçÉéÈèÊêËëÎîÏïÔôŒœÙùÛûÜüŸÿ" turkish = u"ĞğŞşıſ" greek = u"ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω" portuguese = u"ÁÃÌÍÒÓÕÚáãìíòóõú" telugu = u" ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహఽాిీుూృౄెేైొోౌ్ౘౙౠౡౢౣ౦౧౨౩౪౫౬౭౮౯" default = ascii+xsymbols+german+french+portuguese+turkish
european = default+turkish+greek
List of regular expressions for normalizing Unicode text.
Cleans up common homographs. This is mostly used for
training text.
Note that the replacement of pretty much all quotes with
ASCII straight quotes and commas requires some
postprocessing to figure out which of those symbols
represent typographic quotes. See
requote
TODO: We may want to try to preserve more shape; unfortunately,
there are lots of inconsistencies between fonts. Generally,
there seems to be left vs right leaning, and top-heavy vs bottom-heavy
replacements = [ (u'[_~#]',u"~"), # OCR control characters (u'"',u"''"), # typewriter double quote (u"`",u"'"), # grave accent (u'[“”]',u"''"), # fancy quotes (u"´",u"'"), # acute accent (u"[‘’]",u"'"), # left single quotation mark (u"[“”]",u"''"), # right double quotation mark (u"“",u"''"), # German quotes (u"„",u",,"), # German quotes (u"…",u"..."), # ellipsis (u"′",u"'"), # prime (u"″",u"''"), # double prime (u"‴",u"'''"), # triple prime (u"〃",u"''"), # ditto mark (u"µ",u"μ"), # replace micro unit with greek character (u"[–—]",u"-"), # variant length hyphens (u"fl",u"fl"), # expand Unicode ligatures (u"fi",u"fi"), (u"ff",u"ff"), (u"ffi",u"ffi"), (u"ffl",u"ffl"), ]
def requote(s): s = unicode(s) s = re.sub(ur"''",u'"',s) return s
def requote_fancy(s,germanic=0): s = unicode(s) if germanic:
germanic quoting style reverses the shapes
# straight double quotes s = re.sub(ur"\s+''",u"”",s) s = re.sub(u"''\s+",u"“",s) s = re.sub(ur"\s+,,",u"„",s) # straight single quotes s = re.sub(ur"\s+'",u"’",s) s = re.sub(ur"'\s+",u"‘",s) s = re.sub(ur"\s+,",u"‚",s) else: # straight double quotes s = re.sub(ur"\s+''",u"“",s) s = re.sub(ur"''\s+",u"”",s) s = re.sub(ur"\s+,,",u"„",s) # straight single quotes s = re.sub(ur"\s+'",u"‘",s) s = re.sub(ur"'\s+",u"’",s) s = re.sub(ur"\s+,",u"‚",s) return s
Hi, I want to recognize Turkish characters while scanning image file with ocropy. It passes Turkish characters. I have also edited chars.py as
default = ascii+xsymbols+german+french+portuguese+turkish
But noting changed. What should I do/configure to recognize Turkish characters. Thanks in advance.Steps to Reproduce (for bugs)
I have given this commands:
cat book/????/??????.txt > ocr.txt But It can't recognize Turkish Characters, instead to it gives as this
But it should be