Closed DR7777 closed 7 months ago
Thanks for reporting. I was able to replicate. A minimal reproducible example is below.
const worker = await Tesseract.createWorker("eng");
for (let i=0; i<5; i++) {
worker.recognize("https://raw.githubusercontent.com/naptha/tesseract.js/master/benchmarks/data/meditations.jpg").then(ret => {
console.log(ret.data.text);
})
}
This appears to occur when calling worker.recognize
multiple times without waiting for the previous recognition jobs to finish. In my example code above, changing worker.recognize
to await worker.recognize
waits until worker.recognize
call completes before running the next one, which resolves the issue. Therefore, the quick fix to your problem would be to always wait until worker.recognize
is done running before running worker.recognize
again.
Regarding why this would not happen using Tesseract.recognize
--that function creates a new worker every time it is called, so the same worker never gets two jobs. However, this will cause other issues. In addition to making a new worker for every job being inefficient, running Tesseract.recognize
in parallel can lead to an uncontrollably large number of workers being spawned which can cause crashes due to resource constraints.
If you want to run jobs in parallel, the best way to handle this is using schedulers. Schedulers allow for using a defined number of workers in parallel to process jobs. Schedulers are explained here, and an example using schedulers is here.
Thanks for getting back so quick!
Actually I was awaiting the result. I just tried again like this and unfortunately I just keep waiting and waiting for the result of the second image...
const worker = await Tesseract.createWorker("eng");
const promises = urls.map(async (url) => {
try {
const result = await worker.recognize(url);
console.log("result", result);
progress.current += 1;
options.onProgress && options.onProgress(progress);
return {
text: result.data.text,
boxes: result.data.words.map((word) => ({
text: word.text,
box: word.bbox,
})),
};
} catch (error) {
console.error("Error processing URL:", url, error);
return null;
}
});
const results = await Promise.all(promises);
I also tried this method and it works fine and is quite fast aswell. But I assume starting a new worker and terminating it is quite inefficient? Even if I terminate it every time?
On my machine it works quite fast though:
const promises = urls.map(async (url) => {
try {
const worker = await Tesseract.createWorker("eng");
const result = await worker.recognize(url);
console.log("result", result);
progress.current += 1;
options.onProgress && options.onProgress(progress);
await worker.terminate();
return {
text: result.data.text,
boxes: result.data.words.map((word) => ({
text: word.text,
box: word.bbox,
})),
};
} catch (error) {
console.error("Error processing URL:", url, error);
return null;
}
});
const results = await Promise.all(promises);
Actually I was awaiting the result.
When you use map
to call an async
function, map
does not wait for one iteration to finish running before starting the next. Therefore, your code is still running worker.recognize
without waiting for the previous job to complete. This concept be observed in the following example.
const wait1Second = () => new Promise((resolve) => {
setTimeout(() => {
console.log('1 second has passed');
resolve(true);
}, 1000);
});
// console.timeEnd is run after ~5 seconds
console.time();
for (let i = 0; i < 5; i++) {
await wait1Second();
}
console.timeEnd();
// console.timeEnd runs almost immediately
console.time();
[1, 2, 3, 4, 5].map(async () => {
await wait1Second();
});
console.timeEnd();
I also tried this method and it works fine and is quite fast aswell.
The reason why this is fast is because your application is running recognition on different workers in parallel--if you have 5 URLs then it is making 5 workers that run at the same time. As noted in my comment above, this is a problematic way to implement parallel processing because (among other things) the number of Tesseract.js workers it creates is undefined--you can end up creating 12 workers at the same time and crashing your application.
If you want to run multiple workers at the same time--which is the most efficient way to do things--you should implement a scheduler using the resources I linked in my previous comment.
The root cause of this bug is that Tesseract.js workers only store one promise for each action
at a time. This can be seen in the code linked below. If there is an unresolved promise with action recognize
, and a new promise is created with action recognize
, the new promise will overwrite the old promise.
https://github.com/naptha/tesseract.js/blob/master/src/createWorker.js#L51-L71
This causes the first recognize
job that finishes to resolve the last promise created with action recognize
. The remaining jobs still run but do not return anything to the user, as the promise they are trying to resolve has already been resolved.
This should be easily fixable by making the identifier for all promises unique, which can be achieved by appending jobId
. I will do this in the next version of Tesseract.js. However, even after this is fixed, I still do not recommend sending multiple recognize
jobs to the same worker at the same time. As described above, using a scheduler
is the most performant way to handle executing multiple jobs in parallel. Additionally, given that jobs have side effects and there is no queue system built into Tesseract.js workers, it may still be possible for sending a bunch of jobs to a single worker at the same time to cause issues.
The fix described above has been implemented in the master branch and will be included in the next patch release (v5.0.5
).
Thanks a lot for implementing it that quick! Appreciate it!
I am using schedulers now.
Hello!
I have just set up the OCR function and it seems to work if I use this (old deprecated):
const result = await Tesseract.recognize(url, "eng");
But does not work if I use this (new):
const result = await worker.recognize(url);
When I try to run it on a document the "new one" just randomly stops at certain pages / images and doesn't finish without throwing a bug.
See full code below. As I understood from the migration guide this is the only line I had to change?!
Thanks so much for maintaining this library!