naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.08k stars 2.15k forks source link

createWorker throws exception with option.langPath set in electron #925

Open HQidea opened 1 month ago

HQidea commented 1 month ago

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo) "tesseract.js": "^5.1.0", Describe the bug When using tesseract.js in the electron environment, setting option.langPath to a local file path, the createWorker function results in an error. The following code snippet and error message demonstrate the issue:

const worker: Promise<Tesseract.Worker> = createWorker("chi_sim", 1, {
  langPath: path.join(__dirname, "../../resources/ocr")
});

error:

Error: TypeError: Only absolute URLs are supported
    at Worker.<anonymous> (/path/to/node_modules/.pnpm/tesseract.js@5.1.0/node_modules/tesseract.js/src/createWorker.js:248:15)
    at Worker.emit (node:events:518:28)
    at Worker.emit (node:domain:488:12)
    at MessagePort.<anonymous> (node:internal/worker:262:53)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:826:20)
    at exports.emitMessage (node:internal/per_context/messageport:23:28)

Upon examining the code, I found that the condition below appears to be arbitrary since in the electron environment, it should be able to load a language file from a local path as well.

https://github.com/naptha/tesseract.js/blob/83df363b848968402ad38f66f6919e775b483f61/src/worker-script/index.js#L131-L136

Expected behavior It should be able to load a language file from a local path in the electron environment.

Balearica commented 1 month ago

I would agree that the Node.js version should work similarly regardless of whether or not it is being run in Electron, so this condition should be patched if that is not the current behavior.

As an aside, is there a compelling reason to use the Node.js version of Tesseract.js rather than the browser version in an Electron application? My understanding was that Electron supports both, however it was considered a security best practice to avoid giving code access to the filesystem with Node.js integration where possible. However, I have minimal experience with Electron so honestly do not understand the tradeoffs well. The browser version of Tesseract.js can definitely be run using entirely offline resources, as seen in this example repo.

HQidea commented 1 month ago

@Balearica Thx for reply. In Electron, there are two types of processes: main process and render process. As I understand it, what you mentioned and the example you provided is using Tesseract.js in the render process. Upon inspecting the code, I found that the environment is 'webworker' and it utilizes fetch to load language files.

However, in my scenario, I need to run Tesseract.js in the main process after receiving an IPC event from the render process, where the environment is detected as 'electron', and it uses 'node-fetch' to load language files, which does not support local file paths.

Balearica commented 1 month ago

@HQidea Can you make a minimal example repo that can be cloned to replicate this issue? I ran the snippet provided in the OP within the main.js file in the example repo I posted (rather than the index-offline.html file), however everything still ran as expected. Therefore, there is presumably additional context required to replicate this.

HQidea commented 1 month ago

@Balearica this is the repo that can reproduce the error.

I also tried to replicate the error using the repo you provided. However the results are inconsistent. Sometimes the environment in the worker thread is tested as 'electron', and sometimes as 'node'. When the environment is tested as 'node', everything goes well. I'm not sure why.

Balearica commented 1 month ago

@HQidea I was able to reproduce this issue using your repo, and was able to confirm that simply commenting out the special isElectron case within getEnvironment resolved the issue.

https://github.com/naptha/tesseract.js/blob/83df363b848968402ad38f66f6919e775b483f61/src/utils/getEnvironment.js#L8-L9

I am hesitant to edit something without fully understanding what it does, so I reviewed the various issues/PRs regarding why a special case for Electron was introduced to begin with. These are linked below.

  1. 376 (related commit b8aba2edddf41207fc0d04c067c8ba69e8a7ec36)

  2. 474

  3. 489 (related issues #449, #465)

  4. 497 (related PR #498)

After reviewing these discussions, I believe Electron was added as a special case because people wanted to use Node.js in an Electron render process. You do not need special logic to use the Node.js version of Tesseract.js in the main Electron process, and you don't need a special logic to use the browser version of Tesseract.js in the render Electron processes. The only "special case" introduced by Electron is the theoretical ability to run Node.js code in what appears to be a browser context (window.document is defined).

While my understanding of Electron is limited, my initial thought is that this special case is not something we need to support. Using Node.js in a renderer process appears to be strongly advised against in the Electron docs. Additionally, the browser version of Tesseract.js works perfectly well in Electron renderer processes, and is just as fast. If this understanding is correct, and there is not a compelling reason to (attempt to) support using the Node.js version in a renderer process, then we could just cut out the Electron-specific stuff entirely, which should solve your issue.

HQidea commented 1 month ago

@Balearica I totally agree with you on not using Node.js in a renderer process.

Balearica commented 1 week ago

It's unclear to me whether this could break any existing code, so my inclination is to combine this with any other planned breaking changes and release as v6.