robertknight / tesseract-wasm

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node
https://robertknight.github.io/tesseract-wasm/
BSD 2-Clause "Simplified" License
264 stars 27 forks source link

Support for Non-Latin Characters #90

Open k3ntar0 opened 1 year ago

k3ntar0 commented 1 year ago

This project is wonderful! How can we make it compatible with characters used in non-Latin scripts, for example, Japanese characters? Are tessdata available?

robertknight commented 1 year ago

The model data this library loads is the same as the C++ Tesseract, so this means that you can load files from https://github.com/tesseract-ocr/tessdata_best for your language.

How can we make it compatible with characters used in non-Latin scripts, for example, Japanese characters?

The code in this project is in theory script-independent, in the sense that it is mostly concerned with getting data into Tesseract as pixels and out as bounding boxes and Unicode text. If you load the right model, non-Latin languages may already work. However, I have not done any testing of this myself and there may be some extra work required. This is an area where I could use some help from interested users of the library.