How to get the best quality out of Ocropus

peeter-t2 commented 7 years ago

Just a question, but as I can't find a contact address, I'll post it as an issue. Thanks for the help! I'm trying to use Ocropus to process files in German Fraktur. Initial results look a bit worse than the test example. What can I do different to improve the result?

Expected Behavior

It is really great to find the German Fraktur fonts attached to the package. We're trying to process some 18-19 century writings in German. I tried out the test of ersch.png as described in README.md and got fairly good results as expected

bie er wwol für Neri geschrieben, hatte, sind manche ganz mittelmäßige. -=- Im J. 1öi8S weihte er dem Papste den vierten Gand seiner vier - und fünfstimmigen Messen, ges (https://github.com/peeter-t2/test_files/blob/master/ersch.html)

I then tried a test page of my own, with considerably poorer results. I am wondering where I could improve this.

Current Behavior

I tried a test with the file https://github.com/peeter-t2/test_files/blob/master/ous1.png cleaned up from https://github.com/peeter-t2/test_files/blob/master/canvas.pdf.

The results were less good:

E kommt äaker von Izten, Nun gilt es kest eusammenstehn, licht rasten unc nict rasfen. Reict euc äie kianc https://github.com/peeter-t2/test_files/blob/master/ous1.html

I followed the commands here: https://github.com/peeter-t2/test_files/blob/master/processing.txt

Possible Solution

I'm looking for ways of improving the output. What would be the options I should try?

Can I maybe improve the black-and-white file quality somehow?
Do I need better pdf input files?
Can I run Ocropus differently for better results?
Do I need to train it on another corpus?

So essentially, do you know what makes the example ersch attached for testing better from the file I gave, and what are the ways to improve the match? Thanks!

Steps to Reproduce (for bugs)

Follow the steps in https://github.com/peeter-t2/test_files/blob/master/processing.txt with the files in the repository.

Your Environment

Python version: running Python 2.7.6
Git revision of ocropy: from Mon Jan 23 15:21:16 2017 +0100
Operating System and version: Ubuntu 14.04

zuphilip commented 7 years ago

There are different things one can try here and I cannot say exactly what the best option is for you. You may try out some alternatives and see what works best for you.

a. Binarization

There are other ways to binarize the image and it may help...

b. Gray Images

One more hidden option is that you can also use the grayscale images with OCRopus which might lead to better results (AFAIK you still need the binarized versions in the intermediate steps). Try:

ocropus-nlbin temp/canvas.png
ocropus-gpageseg --gray temp/canvas.bin.png
ocropus-rpred temp/canvas/*.nrm.png

c. Training and Model for Font

For me this does not look like a Fraktur font at least not a typical one:

From your picture: From the ersch.png:

I tried the usual font model as well as the Fraktur font model and both result don't seem very good. However, it seems that especially some letters are giving problems d or c. Thus, I guess that it might really help to train this font a little for obtaining a better model you can then use for recognition.

peeter-t2 commented 7 years ago

Thanks a lot for your very quick response! Indeed, on a closer look, it was indeed not at all proper Fraktur, with just some letters being more similar. Maybe there are not many texts like this after all.

I did a new test with proper Fraktur https://github.com/peeter-t2/test_files/blob/master/out06.png to https://github.com/peeter-t2/test_files/blob/master/out6.html and got a really superb quality results! Thanks a lot for developing the package!

I will try setting the parameters some more, and have a look at the greyscale options, thanks for the tip! The best results so far were with 45% threshold for making it black and white. Any higher then that you get too much noise from the paper, while lower than that didn't have enough black for each letter. I'm not sure if there is a good way to exclude those long stripes of black due to the scans, but they can also just be excluded as nonsense words from the dataset.

Thanks for the thorough response, I'll try the parameter settings, but the results are indeed really good!

ocropus-archive / DUP-ocropy