naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
35.31k stars 2.23k forks source link

Error: Network error while fetching #923

Closed jonil3400 closed 5 months ago

jonil3400 commented 6 months ago

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo) "tesseract.js": "^5.1.0", Describe the bug Running createWorker with tgl language results in error.

Uncaught Error: Error: Network error while fetching https://cdn.jsdelivr.net/npm/@tesseract.js-data/TGL/4.0.0_best_int/TGL.traineddata.gz. Response code: 404 at createWorker.js:247:1 at worker.onmessage (onMessage.js:3:1)

To Reproduce await createWorker(["eng", "TGL"]);

Expected behavior TGL language can be used

Device Version: Windows 11 Chrome , Node 18.15

Balearica commented 6 months ago

The language data used by Tesseract.js by default is stored in this repo. Default language data is not something we actively manage/edit, but rather we inherit the default language data from the main Tesseract project.

Looking in this repo, it looks like no tgl (Tagalog) data exists for the LSTM model (the default). Therefore, your options for recognizing it are the following.

  1. You could use the Legacy model (oem value 0), which does support this language.
    1. This can be done by editing to the following: await createWorker(["eng", "tgl"], 0)
  2. You can search online to see if anybody has produced an LSTM Tagalog .traineddata file, or train one yourself, and then use that.
    1. You can make Tesseract.js use custom language data by setting the langPath argument