Open Balearica opened 8 months ago
if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often.
@Balearica I feel that this should be included in the documentation, a lot of time would be saved of developers who are trying to reuse workers in server. They would face memory consumption issue.
@rohitsahu-bstack Good suggestion, I added a new section explaining this case. https://github.com/naptha/tesseract.js/blob/master/docs/workers_vs_schedulers.md#reusing-workers-in-nodejs-server-code
Overview
Tesseract.js currently accepts any valid image, and does not downsize large images. Additionally, while the memory allocated for the webassembly "heap" can increase if needed, it cannot decrease. These behaviors, taken together, can cause issues for applications that run recognition on arbitrary user inputs. A single excessively large image can cause the allocated memory to expand, and for the rest of the workers lifespan, it will always use a large amount of memory. This is especially problematic in cases where schedulers are used with 4+ workers.
Solutions
Individual Projects
Individual projects can mitigate by checking the size of images before sending to Tesseract. If an image is excessively large, it could be rejected or downsized.
Additionally, if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often. While workers are re-usable, and should not be created/killed for every image recognized, there are disadvantages to using them forever. As noted above, memory use can only expand over time, so a single large image will permanently increase the memory footprint of a worker. Additionally, workers "learn" over time by default, editing their internal dictionaries based on words recognized in documents. This is useful within the context of a single document, or group of similar documents, however is not necessarily desirable if recognizing hundreds of unrelated documents. Re-creating the worker resets the dictionary.
Tesseract.js
Eventually, Tesseract.js should automatically downsize images that are over a certain size. This size should be configurable by the user.