madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.78k stars 716 forks source link

How to kill tesseract process #278

Closed Samll-Kosmos closed 4 years ago

Samll-Kosmos commented 4 years ago

Hi,

I'm currently using Gunicorn WSGI for a web service. In the service I use pytesseract to process and extract information from some documents. I have configured Gunicorn with a timeout. I have the problem that when the timeout is reached and the connection is closed the tesseract process is still running in the background. Is there a way to manually kill this tesseract process?

Many thanks

bozhodimitrov commented 4 years ago

@Samll-Kosmos Hi, pytesseract itself supports timeout argument. Please checkout the documentation for examples and more info. When the timeout is reached, pytesseract should kill the related tesseract process. Keep in mind that this method is not graceful and you should not rely on getting result, when timeout is reached.

Samll-Kosmos commented 4 years ago

Thanks for the quick reply.

Somehow using timeout argument helps to solve my problem but it is not optimal. I was looking for a way to kill the process at an arbitrary point in time (in my case, as soon as the web server reaches its configured timeout). But I think this is not possible unless the extraction function (e.g. image_to_data) gives some information about the process being executed.

A possible solution to this is to pass a mutable object to image_to_data which retrieves information about the process. Something like this.

process_info = {}
image_to_data(image, lang, config, nice, output_type, timeout, pandas_config, process_info)

Then, process_info can contain a field called PID with the PID of the tesseract process as soon as the tesseract process starts. But this doesn't sounds like a good solution.

bozhodimitrov commented 4 years ago

Python is a very dynamic language, so If you want you can redefine the pytesseract.pytesseract.timeout_manager. And you can replace it with your custom version that can report the PIDs to you. You can even reuse the timeout argument for a bidirectional communication.

Samll-Kosmos commented 4 years ago

Thank you for your idea. I'm gonna implement it like that. Maybe post here a snippet to show others my solution.