Closed opensemanticsearch closed 4 years ago
The build script of the Debian Package now extracts the OCR config org/apache/tika/parser/ocr/TesseractOCRConfig.properties from the Tika Server JAR, changes timeout setting and adds/overwrites with changed config to/in the Tika Server JAR of the package.
If anybody needs to modify the timeout via REST, just add a header with "X-Tika-OCRTimeout: 200" for 200 seconds of timeout.
Example:
curl -T file_to_ocr.jpg localhost:9998/tika --header "X-Tika-OCRTimeout: 200"
Thanks for your tip, will add that in ETL plugin for the case someone uses a Tika on another server/installation which is not our preconfigured Tika deb package.
Timeout settings now by Open Semantic ETL using header X-Tika-OCRTimeout for Tika-Server.
I am having this pop up now; Its for the fake tika server
java[1950828]: ERROR [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher Timeout task PARSE, millis elapsed 300091, timeoutMillis 300000, file id b'World History.pdf'consider increasing the allowable time with the <taskTimeoutMillis/> parameter or the X-Tika-Timeout-Millis header
Jan 27 22:22:34 mgp java[1950828]: WARN [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher forked process observed TIMEOUT and is shutting down.
Jan 27 22:22:34 mgp java[1950828]: INFO [Thread-22] 22:22:34,214 org.apache.tika.server.core.ServerStatusWatcher Shutting down forked process with status: TIMEOUT
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Connection to Tika server (will retry in 120 seconds) failed. Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Retrying to connect to Tika server in 120 second(s).
Jan 27 22:22:34 mgp java[1929662]: INFO [pool-2-thread-1] 22:22:34,678 org.apache.tika.server.core.TikaServerWatchDog forked process exited with exit value 3
Jan 27 22:22:36 mgp java[1961770]: INFO [main] 22:22:36,867 org.apache.tika.server.core.TikaServerProcess Starting Apache Tika 2.2.1 server
Jan 27 22:22:37 mgp java[1961770]: INFO [main] 22:22:37,014 org.apache.tika.server.core.TikaServerProcess Using custom config: /etc/tika/tika-config-fakecache.xml
Jan 27 22:22:37 mgp java[1961770]: INFO [main] 22:22:37,897 org.apache.cxf.endpoint.ServerImpl Setting the server's publish address to be http://localhost:9999/
Tika default OCR timeout of 120 not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document(s)