naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
35.25k stars 2.23k forks source link

Large images cause excessive memory usage #900

Open Balearica opened 8 months ago

Balearica commented 8 months ago

Overview

Tesseract.js currently accepts any valid image, and does not downsize large images. Additionally, while the memory allocated for the webassembly "heap" can increase if needed, it cannot decrease. These behaviors, taken together, can cause issues for applications that run recognition on arbitrary user inputs. A single excessively large image can cause the allocated memory to expand, and for the rest of the workers lifespan, it will always use a large amount of memory. This is especially problematic in cases where schedulers are used with 4+ workers.

Solutions

Individual Projects

Individual projects can mitigate by checking the size of images before sending to Tesseract. If an image is excessively large, it could be rejected or downsized.

Additionally, if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often. While workers are re-usable, and should not be created/killed for every image recognized, there are disadvantages to using them forever. As noted above, memory use can only expand over time, so a single large image will permanently increase the memory footprint of a worker. Additionally, workers "learn" over time by default, editing their internal dictionaries based on words recognized in documents. This is useful within the context of a single document, or group of similar documents, however is not necessarily desirable if recognizing hundreds of unrelated documents. Re-creating the worker resets the dictionary.

Tesseract.js

Eventually, Tesseract.js should automatically downsize images that are over a certain size. This size should be configurable by the user.

rohitsahu-bstack commented 4 months ago

if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often.

@Balearica I feel that this should be included in the documentation, a lot of time would be saved of developers who are trying to reuse workers in server. They would face memory consumption issue.

Balearica commented 4 months ago

@rohitsahu-bstack Good suggestion, I added a new section explaining this case. https://github.com/naptha/tesseract.js/blob/master/docs/workers_vs_schedulers.md#reusing-workers-in-nodejs-server-code