lars76 / chinese-subtitle-ocr

Optical character recognition for Chinese subtitles using SSD and CNN
MIT License
110 stars 30 forks source link

Is the architecture compatible with Western fonts #2

Closed IDerr closed 6 years ago

IDerr commented 6 years ago

Hi, Congratulations for this project. I wanted to ask if this is a good way to ocr images with latin fonts or is this more specific for chinese fonts.

Thanks for your work

lars76 commented 6 years ago

Hi,

The situation of Chinese is special, because the language has a bigger "alphabet" with about 2 "letters" per word. Languages that use latin characters have a smaller alphabet, but longer words and even spaces. So latin-based languages put more strain on the quality of the language model. This is why, RNN/LSTM/GRU/CTC etc. are quite popular.

Thus, I don't think just changing the training data will suffice. Fortunately, there is a lot of research on latin-based OCR. You can look here for some example code: https://github.com/keras-team/keras/blob/master/examples/image_ocr.py

IDerr commented 6 years ago

Thanks a lot :)