naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages πŸ“–πŸŽ‰πŸ–₯
http://tesseract.projectnaptha.com/
Apache License 2.0
34.09k stars 2.15k forks source link

JSDelivr CDN not accessible in China #899

Open Balearica opened 4 months ago

Balearica commented 4 months ago

Chinese users are reporting that JSDelivr--the CDN used by default for corePath/langPath/workerPath--does not work in China.

I checked the JSDelivr issues, and the maintainers appear to have given up on supporting China.

We had an ICP license until it was revoked for no stated reason. Getting a new one is basically impossible for us. https://github.com/jsdelivr/jsdelivr/issues/18407#issuecomment-1153811957

We won't transfer our domain to a Chinese domain registrar. This breaks most options to get an ICP license. We don't want to create a second domain for Chinese traffic, like cdn.jsdelivr-cn.com. It will completely miss the point of a single unified and global service. Anyone can just mirror jsDelivr to do exactly that. This breaks most other options. We don't have the money or resources to hire Chinese law firms to establish Chinese corporations to run our free service We plan to update our website in the new redesign to note the revoked ICP license We are willing to block any content we need to comply with the local law but we dont know how to get the infringing URLs. https://github.com/jsdelivr/jsdelivr/issues/18407#issuecomment-1154097518

Therefore, the only option would be to add a fallback CDN. I do not want to switch CDNs entirely unless there is an option that will be unequivocally better than JSDelivr for all users, globally. This is because other services that people report currently work in China have previously caused us to receive complaints due to outages (specifically, GitHub Pages and unpkg).

Additionally, it's worth noting that unless any CDN specifically claims to have a relationship with the Chinese government, the fact that it currently works in China does not guarantee it will work in the future. JSDelivr claimed to support China when we started using it.

ivysrono commented 4 months ago

Convert *.traineddata.gz files to pure JavaScript could solve this issue.

*.js files could save and sync to anywhere.

Balearica commented 4 months ago

@ivysrono There are 3 types of files loaded from the CDN by default for browser (workerPath, langPath, corePath) and 1 type of files loaded from the CDN for Node.js (langPath). For all of these files, the JSDelivr CDN is simply the default value--you do not need to use it. All of these files can be hosted on your site (for browser) or local file system (for Node.js), and workerPath, langPath, corePath can be changed to point to those. This is explained in the following document.

https://github.com/naptha/tesseract.js/blob/master/docs/local-installation.md

ivysrono commented 4 months ago

Users may use them in browsers but without own site, for example, userscript: https://greasyfork.org/scripts/482236/code

Balearica commented 4 months ago

If you do not want to host these files yourself, you can set workerPath/langPath/corePath to an alternative CDN. For example, you could try unpkg.

I agree that having a default CDN that does not support China is not ideal, and would be open to adding some fallback for the default CDN in the future. However, if individual developers want to be sure that mainland China is supported in their applications, I believe that the existing options in Tesseract.js do allow them to do that.

ivysrono commented 4 months ago

unpkg support langPath?

Balearica commented 4 months ago

unpkg support langPath?

Here is a working example that uses unpkg for all 3 resources: corePath/langPath/workerPath.

  const lang = 'eng';
  const langPath = `https://unpkg.com/@tesseract.js-data/${lang}/4.0.0_best_int`;

  // A worker is created once and used every time a user uploads a new file.  
  const worker = await Tesseract.createWorker(lang, 1, {
      corePath: 'https://unpkg.com/tesseract.js-core@v5',
      workerPath: 'https://unpkg.com/tesseract.js@v5/dist/worker.min.js',
      langPath: langPath,
      logger: function(m){console.log(m);}
    });

This loads the LSTM-only data, so will only work with oem set to 1 (the default). To use the Legacy model, you would replace 4.0.0_best_int with 4.0.0. That data is significantly larger, so do not do that unless you are actually using the Legacy model.

ivysrono commented 4 months ago

Thank you very much, I will try.

MarketingPip commented 3 months ago

@Balearica - just wanted to say thank you for saving me some time with that comment. Cheers πŸ‘