Closed peeter-t2 closed 7 years ago
There are different things one can try here and I cannot say exactly what the best option is for you. You may try out some alternatives and see what works best for you.
a. Binarization
There are other ways to binarize the image and it may help...
b. Gray Images
One more hidden option is that you can also use the grayscale images with OCRopus which might lead to better results (AFAIK you still need the binarized versions in the intermediate steps). Try:
ocropus-nlbin temp/canvas.png
ocropus-gpageseg --gray temp/canvas.bin.png
ocropus-rpred temp/canvas/*.nrm.png
c. Training and Model for Font
For me this does not look like a Fraktur font at least not a typical one:
From your picture: From the ersch.png:
I tried the usual font model as well as the Fraktur font model and both result don't seem very good. However, it seems that especially some letters are giving problems d
or c
. Thus, I guess that it might really help to train this font a little for obtaining a better model you can then use for recognition.
Thanks a lot for your very quick response! Indeed, on a closer look, it was indeed not at all proper Fraktur, with just some letters being more similar. Maybe there are not many texts like this after all.
I did a new test with proper Fraktur https://github.com/peeter-t2/test_files/blob/master/out06.png to https://github.com/peeter-t2/test_files/blob/master/out6.html and got a really superb quality results! Thanks a lot for developing the package!
I will try setting the parameters some more, and have a look at the greyscale options, thanks for the tip! The best results so far were with 45% threshold for making it black and white. Any higher then that you get too much noise from the paper, while lower than that didn't have enough black for each letter. I'm not sure if there is a good way to exclude those long stripes of black due to the scans, but they can also just be excluded as nonsense words from the dataset.
Thanks for the thorough response, I'll try the parameter settings, but the results are indeed really good!
Just a question, but as I can't find a contact address, I'll post it as an issue. Thanks for the help! I'm trying to use Ocropus to process files in German Fraktur. Initial results look a bit worse than the test example. What can I do different to improve the result?
Expected Behavior
It is really great to find the German Fraktur fonts attached to the package. We're trying to process some 18-19 century writings in German. I tried out the test of ersch.png as described in README.md and got fairly good results as expected
I then tried a test page of my own, with considerably poorer results. I am wondering where I could improve this.
Current Behavior
I tried a test with the file https://github.com/peeter-t2/test_files/blob/master/ous1.png cleaned up from https://github.com/peeter-t2/test_files/blob/master/canvas.pdf.
The results were less good:
I followed the commands here: https://github.com/peeter-t2/test_files/blob/master/processing.txt
Possible Solution
I'm looking for ways of improving the output. What would be the options I should try?
So essentially, do you know what makes the example ersch attached for testing better from the file I gave, and what are the ways to improve the match? Thanks!
Steps to Reproduce (for bugs)
Follow the steps in https://github.com/peeter-t2/test_files/blob/master/processing.txt with the files in the repository.
Your Environment