LSTM vs BLSTM vs MDLSTM

ghost commented 7 years ago

i understand that much of the new Tesseract 4.0 is using a customized implementation of Ocropus, relying basically on the new LSTM recognition engine.

But the main problem is that most of the decisions that are being taken focus mostly on English (Latin Languages) which already able to reach +95% recognition rates easily. My concern is allowing the other languages such as Arabic to be able to reach the PRECISION CEILING.

Methods such as BLSTM (Bidirectional LSTM) , and the two-dimensional 2D LSTM which is called MDLSTM, can achieve without explicit segmentation of words, a character-level accuracies of 92 and 96% !!!!!! and I repeat, without explicit segmentation.

So my question is that, will there be plans to implement and ascend the current LSTM to a MDLSTM (Multi-dimensional LSTM), this will radically make ALL THE LANGUAGES ABLE TO PASS THAT PRECISION CEILING.

i am planing to engage in testing Tesseract 4.0 LSTM on the Arabic language, and wanting to post results in the future, i hope that there will be recognition improvement while testing. Thank you Ray for your hard work, and all contributors, you are appreciated.

More information about BLSTM and MDLSTM: https://www.nist.gov/sites/default/files/documents/itl/iad/mig/OpenHaRT2013_WorkshopPres_A2IA.pdf http://www.a2ialab.com/lib/exe/fetch.php?media=presentations:icdar2015_chinese_slides.pdf https://goo.gl/0wUNfm

amitdo commented 7 years ago

In principle, Tesseract is probably as accurate (or slightly more accurate) than ocropy/clstm.

Tesseract has official trained models for ~100 languages. ocropy has official models for English and German only. Unlike ocropus, Tesseract works on Windows.

BLSTM is implemented and used.

2D-LSTM is also implemented in the library. I think (not sure) it's not used by the released traineddata. Using 2D-LSTM means much longer time to train a model. and for OCRing printed text, the accuracy will not necessary be better than 1D-BLSTM.

BTW, ocropy doesn't have 2D-LSTM support.

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs

excuse the brevity, sent from mobile

On 30-Dec-2016 10:06 PM, "Amit D." notifications@github.com wrote:

In principle, Tesseract is probably as accurate (or slightly more accurate) than ocropy/clstm.

Tesseract has official trained models for ~100 languages. ocropy has official models for English and German only. Unlike ocropus, Tesseract works on Windows.

BLSTM is implemented and used.

2DLSTM is also implemented in the library. I think (not sure) it's not used by the released traineddata. Using 2DLSTM means much longer time to train a model. and for OCRing printed text, the accuracy will not necessary be better than 1D-BLSTM.

BTW, ocropy doesn't have 2D-LSTM support.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/630#issuecomment-269792095, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oweaHUpMa-dyavVOg6KqTEwrrcoHks5rNTMTgaJpZM4LYPU8 .

amitdo commented 7 years ago

But the main problem is that most of the decisions that are being taken focus mostly on English (Latin Languages)

This is not really the situation with the LSTM engine.

The difference in accuracy between Latin script based langs and Arabic is due to

Better traineddata files for the Latin script based langs.
The 'complexity' of the script. Arabic is much more complex.

amitdo commented 7 years ago

Also, the OCR stage is dependent on the layout analysis stage which is weaker for Arabic.

amitdo commented 7 years ago

Shree, indic scripts are even more complex...

roozgar commented 7 years ago

i checked Arabic today with default trained data it have about 80% accuracy what are you looking for?

ghost commented 7 years ago

@amitdo Thanks for clearing things up, improved pre-processing may make 1D-LSTM outperform the more complex MDLSTM. You were right. I see, the main issue is not the ocr engine directly, but is of analysis/segmentation/classification. Perhaps, i should focus on a combination of Tesseract LSTM & a Computer Assisted Transcription method. somewhat similar to: https://sites.google.com/site/paradiitproject/project-definition

@Shreeshrii So Tesseract 4.x has the capability of producing more sophisticated and complex structures.

@roozgar i was looking for a method that gain +85% recognition rate for Arabic language. Tesseract 3.x was using cube for arabic that made me loose hope, But thanks to the developers of Tesseract 4.0 for introducing the new LSTM engine, the hope is back and the community is excited. I am looking forward to test this version after reading that you've got an 80% recognition. @roozgar can you share your training process, the tif/box files and the traineddata.

roozgar commented 7 years ago

as i said i got the result with official trainedata i dont started to my own training yet...

ghost commented 7 years ago

@roozgar what operating system are you using?

Shreeshrii commented 7 years ago

Please see Ray's comment with accuracy figures in https://github.com/tesseract-ocr/tesseract/issues/40

I have found Hindi to have much greater accuracy with LSTM engine.

excuse the brevity, sent from mobile

roozgar commented 7 years ago

@shree ubuntu 16lts

amitdo commented 7 years ago

This is what is used for most of the languages: https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs#full-example-a-multi-layer-lstm-capable-of-high-quality-ocr

I think it is 2D-LSTM.

ghost commented 7 years ago

@amitdo thanks, I have been told that 4.x version of tesseract would be the next big leap, now I believe.

amitdo commented 7 years ago

https://github.com/tensorflow/tensorflow/blob/v1.4.0-rc1/tensorflow/contrib/ndlstm/README.md

olfaa commented 6 years ago

1/ please how i can use blstm to segment a page into textline. 2/ can you give me a blstm architecture model that relates to online document page segmentation of text Thank you:)

tesseract-ocr / tesseract

LSTM vs BLSTM vs MDLSTM #630