How can I improve ocropus accuracy?

IgorMunizS commented 6 years ago

Hi, I'm facing 2 problems:

1 - I need to use Ocropy to extract text from documents in Portuguese. So far, I have added the required characters in char.py and I am training (with a previously trained model) the network based on this: https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth.

2 - I know about document quality restrictions (300 dpi), but some images that I have are bad scans. I've tried the same images in other APIs (like Google Vision) and got better results, but I liked ocropy. I'm wondering if there are some preprocess techniques that can improve the results.

So, what can I do? What is the best way to generate data for training ocropy network? Edit: ocropy training supports multithreading? Thanks!

Python version: Python 2.7.14 :: Anaconda, Inc.
Git revision of ocropy: commit e9b6121de2637e54495125c6a97a4ef75d872a2e Merge: 43381c4 289a58f Author: Konstantin Baierer kba@users.noreply.github.com Date: Mon Feb 19 19:24:12 2018 +0100

Merge pull request #236 from lehzwo/master

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan
Operating System and version: Linux ubuntu-virtual 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

zuphilip commented 6 years ago

How does your confusions look like currently, i.e. ocropus-econf? In general it is hard to say what can improve the accuracy. Can you share here 2-3 of your documents?

IgorMunizS commented 6 years ago

The confusion in test data is: errors 233 missing 0 total 4894 err 4.761 % errnomiss 4.761 % 28 ÇÆ çã 15 8 S 14 Ä á 13 Æ ã 12 Ë í 11 Ï ó 7 È é 7 0A ÇÃ 5 ÇÔ çõ 5 , . 0.0476093175317

I left my model training all night with portuguese texts and images generated by ocropus-linegen. the training error is decreasing, but the test error is worse than the default model (en version). Last 4 test errors: 0.04298535663675012 0.050070854983467174 0.0547945205479452 0.05550307038261691 It's currently in 19000 iterations. I'll see if I can share some files and comeback here. Thanks for your reply!

Edit: Files: output-0 output-1 These are good files. For now, I'm trying to get better results with portuguese characters and not worrying about the quality.

ocropus-archive / DUP-ocropy

How can I improve ocropus accuracy? #296