ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 592 forks source link

How can I improve ocropus accuracy? #296

Open IgorMunizS opened 6 years ago

IgorMunizS commented 6 years ago

Hi, I'm facing 2 problems:

1 - I need to use Ocropy to extract text from documents in Portuguese. So far, I have added the required characters in char.py and I am training (with a previously trained model) the network based on this: https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth.

2 - I know about document quality restrictions (300 dpi), but some images that I have are bad scans. I've tried the same images in other APIs (like Google Vision) and got better results, but I liked ocropy. I'm wondering if there are some preprocess techniques that can improve the results.

So, what can I do? What is the best way to generate data for training ocropy network? Edit: ocropy training supports multithreading? Thanks!

zuphilip commented 6 years ago

How does your confusions look like currently, i.e. ocropus-econf? In general it is hard to say what can improve the accuracy. Can you share here 2-3 of your documents?

IgorMunizS commented 6 years ago

The confusion in test data is: errors 233 missing 0 total 4894 err 4.761 % errnomiss 4.761 % 28 ÇÆ çã 15 8 S 14 Ä á 13 Æ ã 12 Ë í 11 Ï ó 7 È é 7 0A ÇÃ 5 ÇÔ çõ 5 , . 0.0476093175317

I left my model training all night with portuguese texts and images generated by ocropus-linegen. the training error is decreasing, but the test error is worse than the default model (en version). Last 4 test errors: 0.04298535663675012 0.050070854983467174 0.0547945205479452 0.05550307038261691 It's currently in 19000 iterations. I'll see if I can share some files and comeback here. Thanks for your reply!

Edit: Files: output-0 output-1 These are good files. For now, I'm trying to get better results with portuguese characters and not worrying about the quality.