tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.39k stars 9.42k forks source link

How to use Tesseract in a multi-threaded environment? #4281

Open kinghelong opened 2 months ago

kinghelong commented 2 months ago

Current Behavior

#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#pragma comment(lib, "tesseract54.lib")

std::mutex io_mutex;

void performOCR(const std::string& imagePath, int threadId) {
    tesseract::TessBaseAPI* api = new tesseract::TessBaseAPI();
    if (api->Init(NULL, "chi_sim")) {
        std::lock_guard<std::mutex> lock(io_mutex);
        std::cerr << "Could not initialize tesseract for thread " << threadId << std::endl;
        delete api;
        return;
    }

    Pix* image = pixRead(imagePath.c_str());
    if (!image) {
        std::lock_guard<std::mutex> lock(io_mutex);
        std::cerr << "Could not open input image for thread " << threadId << std::endl;
        delete api;
        return;
    }

    api->SetImage(image);
    char* outText = api->GetUTF8Text();

    {
        std::lock_guard<std::mutex> lock(io_mutex);
        std::cout << "Thread " << threadId << " OCR output: " << std::endl << outText << std::endl;
    }

    delete[] outText;
    pixDestroy(&image);
    api->End();
    delete api;
}

int main() {
    const int numThreads = 5; 
    const std::string imagePath = "H:\\1.png";

    std::vector<std::thread> threads;
    for (int i = 0; i < numThreads; ++i) {
        threads.emplace_back(performOCR, imagePath, i);
    }

    for (auto& th : threads) {
        th.join();
    }

    return 0;
}

this is sample code. I am integrating Tesseract OCR into a multithreaded application to perform real-time text recognition from dynamically changing screens. However, I'm encountering several issues related to multithreading:

Exception Handling: Intermittently, the application crashes with access violations or segmentation faults when attempting to interact with Tesseract API functions from multiple threads simultaneously.

Thread Synchronization: Despite using mutexes to synchronize access to Tesseract API calls, I observe occasional data corruption or deadlock situations, particularly when multiple threads concurrently attempt to initialize or interact with Tesseract instances.

Resource Management: There are concerns regarding memory management and resource leaks when multiple OCR tasks are spawned and terminated rapidly in response to screen changes. This includes potential issues with cleanup of Tesseract resources after OCR tasks complete.

Performance Impact: The performance of Tesseract OCR appears to degrade under heavy multithreaded load, leading to increased latency in text recognition or failure to accurately capture screen content changes.

Debugging Output: Debugging the application reveals sporadic errors related to memory access violations or invalid API state transitions, especially when multiple OCR tasks are active concurrently.

I have attempted to implement thread-safe practices such as mutexes and careful resource allocation, but these issues persist. I am seeking guidance on best practices for integrating Tesseract OCR effectively in a multithreaded environment, ensuring stable performance and reliable text recognition across dynamic screen updates.

Expected Behavior

In the multithreaded application integrating Tesseract OCR, the following expected behaviors are anticipated:

Thread Safety: Tesseract OCR operations should be robustly thread-safe, allowing multiple threads to concurrently capture screen content, process bitmap data, and perform text recognition without encountering crashes or resource conflicts.

Real-Time Text Recognition: The application should accurately extract text from dynamically changing screen content in real-time, leveraging Tesseract's capabilities to handle varied fonts, sizes, and languages commonly encountered in screen-based applications.

Performance Optimization: Efficient utilization of system resources to ensure minimal latency in OCR processing, even under heavy concurrent workload scenarios. This includes optimizing memory usage and processing efficiency to maintain responsive performance.

Error Handling: Effective error detection and recovery mechanisms should be in place to gracefully handle exceptional conditions such as image data corruption, API initialization failures, or temporary unavailability of OCR resources.

Scalability: The application should scale seamlessly with the number of concurrent OCR tasks, supporting parallel processing of screen regions and ensuring that OCR results are consistently accurate and reliable.

Resource Management: Proper cleanup and release of resources after OCR tasks complete, ensuring that memory leaks or resource exhaustion issues are minimized, even during rapid task creation and termination cycles.

By achieving these expected behaviors, the integration of Tesseract OCR into a multithreaded environment should enable robust, responsive, and reliable text recognition capabilities across diverse screen-based applications.

Suggested Fix

Suggested Fix:

To address the challenges observed with Tesseract OCR in a multithreaded environment, the following approaches are recommended:

Thread-Safe Initialization: Ensure that Tesseract API initialization (TessBaseAPI::Init) and resource allocation are performed in a thread-safe manner. Consider using mutex locks or synchronization mechanisms to prevent concurrent access issues during initialization.

Scoped API Usage: Utilize Tesseract API functions (SetImage, GetUTF8Text, etc.) within scoped regions to limit their visibility and prevent simultaneous access from multiple threads. This helps in managing concurrent OCR tasks more effectively.

Resource Isolation: Implement strategies to isolate OCR resources per thread or task. For example, allocate separate instances of TessBaseAPI or other necessary objects for each thread to avoid contention over shared resources.

Error Handling and Recovery: Enhance error handling routines to gracefully manage exceptions and recover from OCR failures. Implement retry mechanisms or fallback strategies to retry OCR operations upon transient errors or resource unavailability.

Performance Optimization: Optimize OCR processing by reducing unnecessary resource allocations and minimizing data copying between threads. Utilize efficient memory management techniques and leverage asynchronous processing where applicable to enhance overall system performance.

Testing and Validation: Conduct rigorous testing in diverse multithreaded scenarios to validate the reliability and stability of Tesseract OCR integration. Use stress testing to simulate high concurrent loads and identify potential bottlenecks or performance degradation points.

By implementing these suggested fixes, the application should enhance its robustness and performance when utilizing Tesseract OCR in a multithreaded environment, ensuring smooth operation and accurate text recognition across varying workload conditions.

tesseract -v

No response

Operating System

Windows 11

Other Operating System

No response

uname -a

No response

Compiler

Visual C++ 2022 00482-10000-00261-AA603 C++14

CPU

AMD Ryzen r5 5600g

Virtualization / Containers

No response

Other Information

No response

stweil commented 2 months ago

Please fix the sample code in your report. It should be possible to understand and use it without wasting time on guessing.

Did you know that the Tesseract development is entirely driven by a small number of volunteers? Feel free to fix any issue when you think it's necessary.

amitdo commented 1 month ago

Regarding performance, you should disable OpenMP. either at compile time or at runtime.

amitdo commented 1 month ago

https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#v301

Thread-safety! Moved all critical global and static variables to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.

amitdo commented 1 month ago

https://github.com/tesseract-ocr/tesseract/blob/577e8a8b93a94ded139d66e41ee08d345b3c67ab/src/tesseract.cpp#L675-L678

https://github.com/tesseract-ocr/tesseract/blob/215b023c43f67a52fe4c9f783988503529f5c6dd/src/dict/dict.cpp#L172-L177

Currently, the API does not expose this static variable.