ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

How to train a ocropus model: #293

Closed gjerome5 closed 6 years ago

gjerome5 commented 6 years ago

Bros, i want to create a new training model like 'en-default.pyrnn.gz' How to this bro? I have my train set data as png files. so please help out to create a training model. Environment:

zuphilip commented 6 years ago

Start here: https://github.com/digiah/oldOCR/blob/master/ocropy_getting_started.pdf

kaushikacharya commented 6 years ago

http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html This article is also helpful.

zuphilip commented 6 years ago

I don't understand your question. Please write clearer what you did and what didn't work.

gjerome5 commented 6 years ago

I am working on Cursive hand written text recognition, As of now for printed text ocropus is working fine, but when comes to cursive hand writing recognition, it z not going well with default training model i.e en-default.gz, so we need to build a model for hand written things, do u have any idea on this?

zuphilip commented 6 years ago

Create ground truth with your italics text, see https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth , and then train with that see also at the links in https://github.com/tmbdev/ocropy/wiki .

kaushikacharya commented 6 years ago

IAM database comes with the ground truth for each of the text line.

Example of a line image and its corresponding truth

https://imgur.com/cVPg0Qo A MOVE to stop Mr. Gaitskell from Text Filename: a01-000u-00.gt.txt

This is what I had tried to train on IAM database:

python ocropus-rtrain --load models/en-default.pyrnn.gz -o ../train_models/IAM_full/my_models ../data/IAM_database/traindata/*.png --ntrain 500000

I loaded the default English model and trained over that. This was suggested in the section "Training with the default model" in http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html

Here are my observations:

  1. Even after training for 10 lakh iteration the result was not good.
  2. The training process seems to be quite slow. (Or probably I didn't chose options to make it fast).
zuphilip commented 6 years ago

1.) Ocropy is tailored for printed documents, for handwritten text see also: https://github.com/tmbdev/ocropy/wiki/FAQ#can-ocropus-be-used-for-handwritten-text-recognition

2.) Training in ocropy is not that fast, but you can also look at the C++ implementation https://github.com/tmbdev/clstm