tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

English traineddata file does not contain the '±' character? #48

Open Furtifk opened 1 year ago

Furtifk commented 1 year ago

English traineddata file does not contain the '±' character?

Environment Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki Platform: Windows 10 64bit

I am trying to OCR using the English dictionary file found: https://tesseract-ocr.github.io/tessdoc/Data-Files I notice the character is not included here either: https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset

Are there any plans to add it? Are there any language files that contain successfully OCR this character?

Many thanks to whoever can assist here. I am attaching the file I used to test this behavior for this character here: (https://github.com/tesseract-ocr/langdata_lstm/files/9870674/Special.Symbols.pdf)

amitdo commented 1 year ago

Are there any plans to add it?

The best/fast models were uploaded 5 years ago. AFAIK, no one is working on updating them.

Furtifk commented 1 year ago

Thanks for the information and the fast reply. Would you know of any fix I could have access to OCR this character?

Many thanks ahead of time ^^

stweil commented 1 year ago

The official script/Latin model includes ±. You could also try any of my models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/, for example https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/frak2021-09.traineddata.

Furtifk commented 1 year ago

The official script/Latin model includes ±. You could also try any of my models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/, for example https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/frak2021-09.traineddata.

Thanks a lot. I will try this and let you know here if it does indeed work for us going forward.

Furtifk commented 1 year ago

After further testing, it would appear both lat.traineddata (https://tesseract-ocr.github.io/tessdoc/Data-Files) and your own model are struggling to get this char in my example. Is this the latin dictionary file you meant as I have linked above? If not, where could I find this and download to try it?

Many thanks!

stweil commented 1 year ago

lat.traineddata is a different model. script/Latin is in https://github.com/tesseract-ocr/tessdata_fast/tree/main/script. Or simply re-run the installer and select it there for installation.

Furtifk commented 1 year ago

Thanks for the link. I have tried this on my end with the Latin.traineddata model but I'm still not having much luck with the test file and internal files on my end for getting this character. I'm guessing there's not much else that can be done here? Thanks for the help and suggestions nonetheless.