Open Balearica opened 4 months ago
Convert *.traineddata.gz
files to pure JavaScript could solve this issue.
*.js
files could save and sync to anywhere.
@ivysrono There are 3 types of files loaded from the CDN by default for browser (workerPath
, langPath
, corePath
) and 1 type of files loaded from the CDN for Node.js (langPath
). For all of these files, the JSDelivr CDN is simply the default value--you do not need to use it. All of these files can be hosted on your site (for browser) or local file system (for Node.js), and workerPath
, langPath
, corePath
can be changed to point to those. This is explained in the following document.
https://github.com/naptha/tesseract.js/blob/master/docs/local-installation.md
Users may use them in browsers but without own site, for example, userscript: https://greasyfork.org/scripts/482236/code
If you do not want to host these files yourself, you can set workerPath
/langPath
/corePath
to an alternative CDN. For example, you could try unpkg
.
I agree that having a default CDN that does not support China is not ideal, and would be open to adding some fallback for the default CDN in the future. However, if individual developers want to be sure that mainland China is supported in their applications, I believe that the existing options in Tesseract.js do allow them to do that.
unpkg support langPath?
unpkg support langPath?
Here is a working example that uses unpkg
for all 3 resources: corePath
/langPath
/workerPath
.
const lang = 'eng';
const langPath = `https://unpkg.com/@tesseract.js-data/${lang}/4.0.0_best_int`;
// A worker is created once and used every time a user uploads a new file.
const worker = await Tesseract.createWorker(lang, 1, {
corePath: 'https://unpkg.com/tesseract.js-core@v5',
workerPath: 'https://unpkg.com/tesseract.js@v5/dist/worker.min.js',
langPath: langPath,
logger: function(m){console.log(m);}
});
This loads the LSTM-only data, so will only work with oem
set to 1
(the default). To use the Legacy model, you would replace 4.0.0_best_int
with 4.0.0
. That data is significantly larger, so do not do that unless you are actually using the Legacy model.
Thank you very much, I will try.
@Balearica - just wanted to say thank you for saving me some time with that comment. Cheers π
Chinese users are reporting that JSDelivr--the CDN used by default for
corePath
/langPath
/workerPath
--does not work in China.I checked the JSDelivr issues, and the maintainers appear to have given up on supporting China.
Therefore, the only option would be to add a fallback CDN. I do not want to switch CDNs entirely unless there is an option that will be unequivocally better than JSDelivr for all users, globally. This is because other services that people report currently work in China have previously caused us to receive complaints due to outages (specifically, GitHub Pages and
unpkg
).Additionally, it's worth noting that unless any CDN specifically claims to have a relationship with the Chinese government, the fact that it currently works in China does not guarantee it will work in the future. JSDelivr claimed to support China when we started using it.