paperless-ngx / paperless-ngx

A community-supported supercharged version of paperless: scan, index and archive all your physical documents
https://docs.paperless-ngx.com
GNU General Public License v3.0
19.23k stars 1.05k forks source link

[BUG] 'Unable to run java; is it installed?' stops Tika server from starting correctly #3491

Closed bbbart closed 1 year ago

bbbart commented 1 year ago

Description

I am running paperless-ngx 1.14.5 using the docker images pulled from the docker hub. Everything seems to work smoothly, until I just uploaded a few .docx files.

paperless-webserver-1  | [2023-05-31 21:36:36,075] [INFO] [paperless.consumer] Consuming 310-108755-000 CONTRACT.docx
paperless-webserver-1  | [2023-05-31 21:36:36,121] [INFO] [paperless.parsing.tika] Sending /tmp/paperless/paperless-ngxbjv8e561/310-108755-000 CONTRACT.docx to Tika server
paperless-webserver-1  | [2023-05-31 21:36:36,227] [WARNING] [celery.redirected] 2023-05-31 21:36:36,227 [MainThread  ] [ERROR]  Unable to run java; is it installed?
paperless-webserver-1  | [2023-05-31 21:36:36,227] [ERROR] [tika.tika] Unable to run java; is it installed?
paperless-webserver-1  | [2023-05-31 21:36:36,228] [WARNING] [celery.redirected] 2023-05-31 21:36:36,228 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
paperless-webserver-1  | [2023-05-31 21:36:36,228] [ERROR] [tika.tika] Failed to receive startup confirmation from startServer.
paperless-webserver-1  | [2023-05-31 21:36:36,232] [ERROR] [paperless.consumer] Error while consuming document 310-108755-000 CONTRACT.docx: Could not parse /tmp/paperless/paperless-ngxbjv8e561/310-108755-000 CONTRACT.docx with tika server at http://localhost:9998: Unable to start Tika server.

Strangely however, docker-compse ps clearly lists /bin/sh -c 'exec java -cp \"/tika-server-standard-${TIKA_VERSION}.jar:/tika-extras/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@' as the command for the paperless-tika-1 image (ghcr.io/paperless-ngx/tika:latest), so java seems installed after all?

Steps to reproduce

  1. pull the latest paperless-ngx docker image
  2. run it
  3. let is consume a .docx file

Webserver logs

not relevant in this case,I think?

I have access to all logs and am happy to share if necessary.

Browser logs

No response

Paperless-ngx version

1.14.5

Host OS

Arch Linux (x86_64)

Installation method

Docker - official image

Browser

Librewolf (irrelevant here)

Configuration changes

changed http://gotenberg:3000 to http://localhost:3000 and http://tika:9998 to http://localhost:9998

Other

No response

shamoon commented 1 year ago

If you’re using our compose http://localhost:9998/ isn’t correct, that’d be hitting port 9998 on the webserver container

In general I’m not aware of any issues with these

stumpylog commented 1 year ago

Yes, I agree. You haven't configured Tika correctly and the current client we use it attempting to download and run a jar file instead.

You can see an example compose file with Tika here. In particular, these lines show how to correctly configure the remote Tika image.

bbbart commented 1 year ago

Ah, I see. I indeed changed the domains from tika and gotenborg to localhost. Guess I was trying to be smart about it.

Anyway, after reverting these two changes, I'm now back to the error I had before I started meddling around. I'll copy paste the error below, but feel free to let me know if this requires a new issue to be opened.

paperless-webserver-1  | [2023-06-01 09:13:32,348] [INFO] [paperless.consumer] Consuming 310-108755-000 CONTRACT.docx
paperless-webserver-1  | [2023-06-01 09:13:32,395] [INFO] [paperless.parsing.tika] Sending /tmp/paperless/paperless-ngxk_c3rzt5/310-108755-000 CONTRACT.docx to Tika server
paperless-tika-1       | INFO  [qtp495051009-21] 07:13:32,475 org.apache.tika.server.core.resource.RecursiveMetadataResource /rmeta (autodetecting type)
paperless-webserver-1  | [2023-06-01 09:13:33,299] [WARNING] [paperless.parsing.tika] Unable to extract date for document /tmp/paperless/paperless-ngxk_c3rzt5/310-108755-000 CONTRACT.docx: 'Creation-Date'
paperless-webserver-1  | [2023-06-01 09:13:33,299] [INFO] [paperless.parsing.tika] Converting /tmp/paperless/paperless-ngxk_c3rzt5/310-108755-000 CONTRACT.docx to PDF as /tmp/paperless/paperless-lsltxvuj/convert.pdf
paperless-gotenberg-1  | {"level":"error","ts":1685603614.1182299,"logger":"api","msg":"convert PDF: convert PDF to 'PDF/A-2b' with multi PDF engines: convert PDF to 'PDF/A-2b' with PDFcpu: method not available; convert PDF to 'PDF/A-2b' with PDFtk: method not available; convert PDF to 'PDF/A-2b' with QPDF: method not available; convert PDF to 'PDF/A-2b' with unoconv: PDF format not available","trace":"73e032b0-dc16-400c-8a3d-067934874140","remote_ip":"172.21.0.6","host":"gotenberg:3000","uri":"/forms/libreoffice/convert","method":"POST","path":"/forms/libreoffice/convert","referer":"","user_agent":"python-requests/2.30.0","status":400,"latency":812251145,"latency_human":"812.251145ms","bytes_in":122442,"bytes_out":131}
paperless-webserver-1  | [2023-06-01 09:13:34,122] [ERROR] [paperless.consumer] Error while consuming document 310-108755-000 CONTRACT.docx: Error while converting document to PDF: 400 Client Error: Bad Request for url: http://gotenberg:3000/forms/libreoffice/convert

Basically, http://gotenberg:3000/forms/libreoffice/convert replies with a 400 (bad request).

shamoon commented 1 year ago

Re-create the containers, this has come up a few times, https://github.com/paperless-ngx/paperless-ngx/discussions/2868 https://github.com/search?q=repo%3Apaperless-ngx%2Fpaperless-ngx+gotenberg+bad+request&type=issues https://github.com/search?q=repo%3Apaperless-ngx%2Fpaperless-ngx+gotenberg+bad+request&type=discussions and seems to be setup-related, not a real bug

bbbart commented 1 year ago

ah, that works. moved to gotenborg 7.8 as well.

thanks for the pointer!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.