Open englianhu opened 3 years ago
I believe we need to train the Chinese language again, I know you've got a reply after so long. have you tried to fix it
Updated Data Files (September 15, 2017)
We have three sets of .traineddata files on GitHub in three separate repositories. These are compatible with Tesseract 4.0x+ and 5.0.0.Alpha.
| Trained models | Speed | Accuracy | Supports legacy | Retrainable -- | -- | -- | -- | -- | -- tessdata | Legacy + LSTM (integerized tessdata-best) | Faster than tessdata-best | Slightly less accurate than tessdata-best | Yes | No tessdata-best | LSTM only (based on langdata) | Slowest | Most accurate | No | Yes tessdata-fast | Integerized LSTM of a smaller network than tessdata-best | Fastest | Least accurate | No | NoMost users will want
tessdata_fast
and that is what will be shipped as part of Linux distributions.
tessdata_best
is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.The third set in
tessdata
is the only one that supports the legacy recognizer. The 4.00 files from November 2016 have both legacy and older LSTM models. The current set of files intessdata
have the legacy models and newer LSTM models (integer versions of 4.00.00 alpha models in tessdata_best).Note: When using the new models in the
Updated Data Files (September 15, 2017) We have three sets of .traineddata files on GitHub in three separate repositories. These are compatible with Tesseract 4.0x+ and 5.0.0.Alpha. Trained models Speed Accuracy Supports legacy Retrainable [tessdata](https://github.com/tesseract-ocr/tessdata) Legacy + LSTM (integerized tessdata-best) Faster than tessdata-best Slightly less accurate than tessdata-best Yes No [tessdata-best](https://github.com/tesseract-ocr/tessdata_best) LSTM only (based on [langdata](https://github.com/tesseract-ocr/langdata)) Slowest Most accurate No Yes [tessdata-fast](https://github.com/tesseract-ocr/tessdata_fast) Integerized LSTM of a smaller network than tessdata-best Fastest Least accurate No No Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions. tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users. The third set in tessdata is the only one that supports the legacy recognizer. The 4.00 files from November 2016 have both legacy and older LSTM models. The current set of files in tessdata have the legacy models and newer LSTM models (integer versions of 4.00.00 alpha models in tessdata_best). Note: When using the new models in the tessdata_best and tessdata_fast repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract’s oem modes ‘0’ and ‘2’ won’t work with them.tessdata_best
andtessdata_fast
repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract’s oem modes ‘0’ and ‘2’ won’t work with them.
Trying to download few different ocr models to analyse https://gd-pub.jinshujufiles.com/di/20180308130431_f4fead but how to download it?
## https://github.com/tesseract-ocr/tessdata
if(is.na(match('chi_sim.traineddata', tesseract_info()$available)))
tesseract_download('tesseract-ocr/tessdata/chi_sim.traineddata')
Downloaded: 0.10 MB
错误: Download failed: HTTP 404
if(is.na(match('chi_sim_vert.traineddata', tesseract_info()$available)))
tesseract_download('tesseract-ocr/tessdata/chi_sim_vert.traineddata')
Downloaded: 0.10 MB
错误: Download failed: HTTP 404
## https://github.com/tesseract-ocr/tessdata_best
if(is.na(match('chi_sim.traineddata', tesseract_info()$available)))
tesseract_download('tesseract-ocr/tessdata_best/chi_sim.traineddata')
Downloaded: 0.10 MB
错误: Download failed: HTTP 404
if(is.na(match('chi_sim_vert.traineddata', tesseract_info()$available)))
tesseract_download('tesseract-ocr/tessdata_best/chi_sim_vert.traineddata')
Downloaded: 0.10 MB
错误: Download failed: HTTP 404
## https://github.com/tesseract-ocr/tessdata_fast
if(is.na(match('chi_sim.traineddata', tesseract_info()$available)))
tesseract_download('tesseract-ocr/tessdata_fast/chi_sim.traineddata')
Downloaded: 0.10 MB
错误: Download failed: HTTP 404
if(is.na(match('chi_sim_vert.traineddata', tesseract_info()$available)))
tesseract_download('tesseract-ocr/tessdata_fast/chi_sim_vert.traineddata')
Downloaded: 0.10 MB
错误: Download failed: HTTP 404
I tried to
ocr
an image inchi_sim
but there quality is not too good, some characters unable recognize... Is there anyway to improve accuracy?Originally posted by @englianhu in https://github.com/tesseract-ocr/tessdata/issues/146#issuecomment-738143925