naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.09k stars 2.15k forks source link

Worker stuck on "loading language traineddata" #901

Closed laurent22 closed 1 week ago

laurent22 commented 3 months ago

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)

5.0.4

Describe the bug

This is the same issue as https://github.com/naptha/tesseract.js/issues/414, which normally should have been addressed with the errorHandler property but not in all cases it seems. I'm using Tesseract.js with Electron and it get stuck at the message { workerId: "Worker-0-ac418", status: "loading language traineddata", progress: 0 }

I set the errorHandler property but it's never triggered.

Using the "lazy fox" default image.

And the same fix as mentioned in the other issue, setting cacheMethod: 'none' works, but I'd rather keep the cache enabled since downloading 10 MB every time wouldn't make sense.

Edit:

I've just discovered that Tesseract.js has a second way to log using Tesseract.setLogging so I set that to true but it didn't help. It just prints [Worker-0-e9fc5]: Start Job-1-4ae93, action=loadLanguage followed by the dreaded loading language traineddata message.

Device Version:

Balearica commented 3 months ago

Was this a one-time thing that was resolved once you deleted/refreshed the cache data, or can it be replicated? If it can be replicated, please provide a reproducible example.

laurent22 commented 3 months ago

I couldn't find where it stores the cache and setting langPath didn't seem to have any effect. Where can I find the cache data? For now I have disabled the cache but if I enable it again I think it will happen again, and then I can share these cached files so that the bug can be replicated

Balearica commented 3 months ago

Files are cached at ${cachePath}/${lang}.traineddata, where cachePath is determined by the cachePath argument (. by default). For the browser version of Tesseract.js the file is cached in IndexDB, and for the Node.js version of Tesseract.js the file is cached on the local file system.

For example, the following snippet will download eng.traineddata from IndexDB on browser. It must be run from the devtools console on a website that has previously saved eng.traineddata to the cache.

(async () => {
        // Open a connection to the database
        const openRequest = indexedDB.open('keyval-store');

        const db = await new Promise((resolve, reject) => {
            openRequest.onerror = () => reject(openRequest.error);
            openRequest.onsuccess = () => resolve(openRequest.result);
        });

        // Start a transaction and get the object store
        const transaction = db.transaction(['keyval'], 'readonly');
        const store = transaction.objectStore('keyval');

        // Use the key to get the file as a Blob
        const getRequest = store.get('./eng.traineddata');

        const data = await new Promise((resolve, reject) => {
            getRequest.onerror = () => reject(getRequest.error);
            getRequest.onsuccess = () => resolve(getRequest.result);
        });

        const blob = new Blob([data], {type: 'application/octet-stream'});

        // Create a URL for the blob
        const url = URL.createObjectURL(blob);

        // Create a temporary anchor element to trigger download
        const a = document.createElement('a');
        a.href = url;
        a.download = 'eng.traineddata'; 
        document.body.appendChild(a);
        a.click();
        document.body.removeChild(a);

        // Revoke the blob URL after download
        URL.revokeObjectURL(url);

})();
Balearica commented 1 month ago

@laurent22 To follow up, were you ever able to replicate this issue in a reproducible way and/or figure out what you think the root cause is?