opensemanticsearch / tika-server.deb

Apache Tika Server as Debian GNU/Linux and Ubuntu Linux package
https://opensemanticsearch.org
5 stars 8 forks source link

Increase OCR timeout #10

Closed opensemanticsearch closed 4 years ago

opensemanticsearch commented 5 years ago

Tika default OCR timeout of 120 not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document(s)

Mandalka commented 5 years ago

The build script of the Debian Package now extracts the OCR config org/apache/tika/parser/ocr/TesseractOCRConfig.properties from the Tika Server JAR, changes timeout setting and adds/overwrites with changed config to/in the Tika Server JAR of the package.

rmazzine commented 5 years ago

If anybody needs to modify the timeout via REST, just add a header with "X-Tika-OCRTimeout: 200" for 200 seconds of timeout.

Example:

curl -T file_to_ocr.jpg localhost:9998/tika --header "X-Tika-OCRTimeout: 200"

opensemanticsearch commented 5 years ago

Thanks for your tip, will add that in ETL plugin for the case someone uses a Tika on another server/installation which is not our preconfigured Tika deb package.

Mandalka commented 4 years ago

Timeout settings now by Open Semantic ETL using header X-Tika-OCRTimeout for Tika-Server.

MparkG commented 2 years ago

I am having this pop up now; Its for the fake tika server

java[1950828]: ERROR [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher Timeout task PARSE, millis elapsed 300091, timeoutMillis 300000, file id b'World History.pdf'consider increasing the allowable time with the <taskTimeoutMillis/> parameter or the X-Tika-Timeout-Millis header
Jan 27 22:22:34 mgp java[1950828]: WARN  [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher forked process observed TIMEOUT and is shutting down.
Jan 27 22:22:34 mgp java[1950828]: INFO  [Thread-22] 22:22:34,214 org.apache.tika.server.core.ServerStatusWatcher Shutting down forked process with status: TIMEOUT
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Connection to Tika server (will retry in 120 seconds) failed. Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Retrying to connect to Tika server in 120 second(s).
Jan 27 22:22:34 mgp java[1929662]: INFO  [pool-2-thread-1] 22:22:34,678 org.apache.tika.server.core.TikaServerWatchDog forked process exited with exit value 3
Jan 27 22:22:36 mgp java[1961770]: INFO  [main] 22:22:36,867 org.apache.tika.server.core.TikaServerProcess Starting Apache Tika 2.2.1 server
Jan 27 22:22:37 mgp java[1961770]: INFO  [main] 22:22:37,014 org.apache.tika.server.core.TikaServerProcess Using custom config: /etc/tika/tika-config-fakecache.xml
Jan 27 22:22:37 mgp java[1961770]: INFO  [main] 22:22:37,897 org.apache.cxf.endpoint.ServerImpl Setting the server's publish address to be http://localhost:9999/