tmbdev / clstm

A small C++ implementation of LSTM networks, focused on OCR.
Apache License 2.0
821 stars 224 forks source link

Arabic 800,000 model cant go below Error Rate 0.5 #133

Open ghost opened 7 years ago

ghost commented 7 years ago

I have been training an Arabic language model from scratch for days now, reaching +800,000 epochs, the error rate wont go below 0.5 and thats very bad. I have used artificial training data that I have created, here are there specifications: Arabic, no diacritics, 300dpi, black and white, 100% correct transcriptions, about 2100 lines. The CLSTM settings consists of hidden=100 and lrate=1e-4

Can anybody help @tmbdev @mittagessen

amitdo commented 7 years ago

Try to find out where that 0.5 comes from.

Maybe the errors are mostly with dot, comma, and spaces.

zuphilip commented 7 years ago

I would suggest to use ocropus-econf *.gt.txt to see the most common confusions, see https://github.com/tmbdev/ocropy/wiki/Compute-errors-and-confusions.

mittagessen commented 7 years ago

The error is almost certainly caused by incorrect ordering of the training data (error will usually hover around 0.6). The code points have to be in display order (i.e. left-to-right) instead of reading order (right-to-left). If you've created them using kraken/ketos run linegen with the --reorder option to fix this. It doesn't default to this option as the training interface is intended to deal with that for you once it is finished.

ghost commented 7 years ago

Thanks for your reply, I am taking all your suggestions, and will work on tracing the error. keep the issue open