worker.recognize fails when run before previous job finishes

naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

http://tesseract.projectnaptha.com/

Apache License 2.0

34.91k stars 2.21k forks source link

worker.recognize fails when run before previous job finishes #875

Closed DR7777 closed 7 months ago

DR7777 commented 8 months ago

Hello!

I have just set up the OCR function and it seems to work if I use this (old deprecated): const result = await Tesseract.recognize(url, "eng");

But does not work if I use this (new): const result = await worker.recognize(url);

When I try to run it on a document the "new one" just randomly stops at certain pages / images and doesn't finish without throwing a bug.

See full code below. As I understood from the migration guide this is the only line I had to change?!

import Tesseract from "tesseract.js";

type OcrOnImagesOptions = {
  onProgress?: (progress: { current: number; total: number }) => void;
  onStart?: (progress: { current: 0; total: number }) => void;
};

const OcrOnImages = async (
  urls: string[],
  options?: OcrOnImagesOptions 
): Promise<
  Record<
    string,
    { text: string; boxes: Array<{ text: string; box: Tesseract.Box }> }
  >
> => {
  options.onStart && options.onStart({ current: 0, total: urls.length });
  const progress = { total: urls.length, current: 0 };

  const worker = await Tesseract.createWorker("eng");

  const promises = urls.map(async (url) => {
    try {
      // THIS DOES NOT WORK
      const result = await worker.recognize(url);
      console.log("result", result);

      // THIS WORKS PERFECTLY
      // const result = await Tesseract.recognize(url, "eng");
      // console.log("result1", result);

      progress.current += 1;
      options.onProgress && options.onProgress(progress);

      return {
        text: result.data.text,
        boxes: result.data.words.map((word) => ({
          text: word.text,
          box: word.bbox,
        })),
      };
    } catch (error) {
      console.error("Error processing URL:", url, error);
      return null;
    }
  });

  const results = await Promise.all(promises);

  return results.reduce((acc, data, index) => {
    return { ...acc, [index + 1]: data };
  }, {});
};

export default OcrOnImages;

Thanks so much for maintaining this library!

Balearica commented 8 months ago

Thanks for reporting. I was able to replicate. A minimal reproducible example is below.

const worker = await Tesseract.createWorker("eng");

for (let i=0; i<5; i++) {
  worker.recognize("https://raw.githubusercontent.com/naptha/tesseract.js/master/benchmarks/data/meditations.jpg").then(ret => {
    console.log(ret.data.text);
  })
}

This appears to occur when calling worker.recognize multiple times without waiting for the previous recognition jobs to finish. In my example code above, changing worker.recognize to await worker.recognize waits until worker.recognize call completes before running the next one, which resolves the issue. Therefore, the quick fix to your problem would be to always wait until worker.recognize is done running before running worker.recognize again.

Regarding why this would not happen using Tesseract.recognize--that function creates a new worker every time it is called, so the same worker never gets two jobs. However, this will cause other issues. In addition to making a new worker for every job being inefficient, running Tesseract.recognize in parallel can lead to an uncontrollably large number of workers being spawned which can cause crashes due to resource constraints.

If you want to run jobs in parallel, the best way to handle this is using schedulers. Schedulers allow for using a defined number of workers in parallel to process jobs. Schedulers are explained here, and an example using schedulers is here.

DR7777 commented 8 months ago

Thanks for getting back so quick!

Actually I was awaiting the result. I just tried again like this and unfortunately I just keep waiting and waiting for the result of the second image...

const worker = await Tesseract.createWorker("eng");

  const promises = urls.map(async (url) => {
    try {
      const result = await worker.recognize(url);
      console.log("result", result);

      progress.current += 1;
      options.onProgress && options.onProgress(progress);

      return {
        text: result.data.text,
        boxes: result.data.words.map((word) => ({
          text: word.text,
          box: word.bbox,
        })),
      };
    } catch (error) {
      console.error("Error processing URL:", url, error);
      return null; 
    }
  });

  const results = await Promise.all(promises);

I also tried this method and it works fine and is quite fast aswell. But I assume starting a new worker and terminating it is quite inefficient? Even if I terminate it every time?

On my machine it works quite fast though:

  const promises = urls.map(async (url) => {
    try {
      const worker = await Tesseract.createWorker("eng");
      const result = await worker.recognize(url);
      console.log("result", result);

      progress.current += 1;
      options.onProgress && options.onProgress(progress);

      await worker.terminate();

      return {
        text: result.data.text,
        boxes: result.data.words.map((word) => ({
          text: word.text,
          box: word.bbox,
        })),
      };
    } catch (error) {
      console.error("Error processing URL:", url, error);
      return null; 
    }
  });

  const results = await Promise.all(promises);

Balearica commented 8 months ago

Actually I was awaiting the result.

When you use map to call an async function, map does not wait for one iteration to finish running before starting the next. Therefore, your code is still running worker.recognize without waiting for the previous job to complete. This concept be observed in the following example.

const wait1Second = () => new Promise((resolve) => {
  setTimeout(() => {
    console.log('1 second has passed');
    resolve(true);
  }, 1000);
});

// console.timeEnd is run after ~5 seconds
console.time();
for (let i = 0; i < 5; i++) {
  await wait1Second();
}
console.timeEnd();

// console.timeEnd runs almost immediately
console.time();
[1, 2, 3, 4, 5].map(async () => {
  await wait1Second();
});
console.timeEnd();

I also tried this method and it works fine and is quite fast aswell.

The reason why this is fast is because your application is running recognition on different workers in parallel--if you have 5 URLs then it is making 5 workers that run at the same time. As noted in my comment above, this is a problematic way to implement parallel processing because (among other things) the number of Tesseract.js workers it creates is undefined--you can end up creating 12 workers at the same time and crashing your application.

If you want to run multiple workers at the same time--which is the most efficient way to do things--you should implement a scheduler using the resources I linked in my previous comment.

Balearica commented 7 months ago

The root cause of this bug is that Tesseract.js workers only store one promise for each action at a time. This can be seen in the code linked below. If there is an unresolved promise with action recognize, and a new promise is created with action recognize, the new promise will overwrite the old promise.

https://github.com/naptha/tesseract.js/blob/master/src/createWorker.js#L51-L71

This causes the first recognize job that finishes to resolve the last promise created with action recognize. The remaining jobs still run but do not return anything to the user, as the promise they are trying to resolve has already been resolved.

This should be easily fixable by making the identifier for all promises unique, which can be achieved by appending jobId. I will do this in the next version of Tesseract.js. However, even after this is fixed, I still do not recommend sending multiple recognize jobs to the same worker at the same time. As described above, using a scheduler is the most performant way to handle executing multiple jobs in parallel. Additionally, given that jobs have side effects and there is no queue system built into Tesseract.js workers, it may still be possible for sending a bunch of jobs to a single worker at the same time to cause issues.

Balearica commented 7 months ago

The fix described above has been implemented in the master branch and will be included in the next patch release (v5.0.5).

DR7777 commented 6 months ago

Thanks a lot for implementing it that quick! Appreciate it!

I am using schedulers now.