tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.23k stars 9.51k forks source link

Encoding of string failed! Chinese #3848

Open Gnakkk opened 2 years ago

Gnakkk commented 2 years ago

Environment

Tesseract Version: v5.1.0.20220510

Current Behavior:

Extracting tessdata components from chi_sim.traineddata Wrote chi_sim.lstm Version:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] 0:config:size=1966, offset=192 17:lstm:size=12152851, offset=2158 18:lstm-punc-dawg:size=282, offset=12155009 19:lstm-word-dawg:size=590634, offset=12155291 20:lstm-number-dawg:size=82, offset=12745925 21:lstm-unicharset:size=258834, offset=12746007 22:lstm-recoder:size=72494, offset=13004841 23:version:size=84, offset=13077335 Loaded file D:\Download\tess\tess_trainrance\chi_sim.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from D:\Download\tess\tess_trainrance\chi_sim.lstm Encoding of string failed! Failure bytes: e8 af b6 e8 ae a4 e8 ae a4 e8 ae a4 7e 7e 7e e2 80 a6 e2 80 a6 e3 80 8d Can't encode transcription: '銆屽憸鍛溾€︹€?楦e憸鍛滆璁よ璁~~鈥︹€︺€? in language '' Encoding of string failed! Failure bytes: e8 af b6 e8 ae a4 e8 ae a4 e8 ae a4 7e 7e 7e e2 80 a6 e2 80 a6 e3 80 8d Can't encode transcription: '銆屽憸鍛溾€︹€?楦e憸鍛滆璁よ璁~~鈥︹€︺€? in language '' Encoding of string failed! Failure bytes: e8 af b6 e8 ae a4 e8 ae a4 e8 ae a4 7e 7e 7e e2 80 a6 e2 80 a6 e3 80 8d Can't encode transcription: '銆屽憸鍛溾€︹€?楦e憸鍛滆璁よ璁~~鈥︹€︺€? in language ''

Expected Behavior:

There is a Chinese character '诶' in the picture I trained, and the error Encoding of string failed! will be reported after training. After my investigation and testing, I found that the reason is that the chi_sim.traineddata of tessdata_best does not have '诶'. what should I do? How to add '诶' in chi_sim.traineddata?

Suggested Fix:

nobblanger commented 2 years ago

https://blog.csdn.net/watt/article/details/124099032 你可以参考下这个博客 , 文章作者拆包了tessdata_best中文的编码是不全的 , 你可以按他操作从头训练以添加你需要的中文编码.

Warning: LSTMTrainer deserialized an LSTMRecognizer! 还有 , 问个问题 , 这个警告 , 我训练时也遇到了 , 但是训练正常 . 想问下这个警告是啥意思 , 有啥影响 , 还有如何避免出现这个警告 ?

EurekaChen commented 1 year ago

如果使用了其他编码的文字,例如emoji,还是会出现这样的错误:

Encoding of string failed! Failure bytes: f0 9f 80 94
Can't encode transcription: '🀔 ' in language ''

不知道该怎么解决?