wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.67k stars 351 forks source link

运用keras,tensorflow实现自然场景文字检测,ctc 实现不定长中文OCR识别 #67

Closed wanghaisheng closed 6 years ago

wanghaisheng commented 7 years ago

https://github.com/chineseocr/chinese-ocr 特别棒 看起来

xiaomaxiao commented 6 years ago

跑了一下 环境win10 TITANX CTPN 125 ms CRNN 32X800 160ms

wanghaisheng commented 6 years ago

@xiaomaxiao 最后tesseract的 lstm 训练数据找到了吗

xiaomaxiao commented 6 years ago

@wanghaisheng tesseract 没有公布 lstm的训练数据。

wanghaisheng commented 6 years ago

@xiaomaxiao https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

https://github.com/tesseract-ocr/tessdata/issues/72

https://github.com/tesseract-ocr/tessdata/ These language data files only work with Tesseract 4. They are based on the sources in tesseract-ocr/langdata on GitHub.

Get language data files for Tesseract 3.04 or 3.05 from the 3.04 tree.

More information and a complete list of all languages is available in the Tesseract wiki.

xiaomaxiao commented 6 years ago

@wanghaisheng https://github.com/tesseract-ocr/langdata/issues/94

Langdata has not been updated for 4.0

You can use current files for finetuning, not for training from scratch.

wanghaisheng commented 6 years ago

@xiaomaxiao chinese-ocr 效果怎么样

xiaomaxiao commented 6 years ago

@wanghaisheng CTPN 泛化很强,大部分都能detect ,但是 针对扫描文档 重新训练会更好。 CRNN的部分是比较耗时。

你有测试EAST TEXTBOX这些么?

wanghaisheng commented 6 years ago

@xiaomaxiao 扫码文档我们现在自己做了切行 我想问的是识别效果怎么样 和tesseract比呢

xiaomaxiao commented 6 years ago

@wanghaisheng CRNN 比 TESSERACT好。

你是怎么做的切行?可否分享下。

wanghaisheng commented 6 years ago

@xiaomaxiao 暂时切行不方便分享~~