jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 356 forks source link

[BUG] Installed from script and Gotenburg and Tika not working? #1594

Open 2600box opened 2 years ago

2600box commented 2 years ago

Hello, thanks for this great work!

I am new to paperless-ng do not normally use docker, so I may be doing something wrong.

My paperless works well, but when I try to import a .docx file for example, it fails with, Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office

I installed using the script, and specified to enable Tika.

Gothenburg and Tika are running according to docker ps

paperless@docker ~/paperless-ng$ docker ps
CONTAINER ID   IMAGE                              COMMAND                  CREATED          STATUS                           PORTS                                       NAMES
8a20a33aefa6   jonaswinkler/paperless-ng:latest   "/sbin/docker-entryp_"   2 minutes ago    Up 2 minutes (healthy)           0.0.0.0:8000->8000/tcp, :::8000->8000/tcp   paperless-webserver-1
b4d6babc41a2   postgres:13                        "docker-entrypoint.s_"   24 minutes ago   Up 23 minutes                    5432/tcp                                    paperless-db-1
ed4b52bfb5a4   redis:6.0                          "docker-entrypoint.s_"   24 minutes ago   Up 23 minutes                    6379/tcp                                    paperless-broker-1
d8bf67ec76c5   thecodingmachine/gotenberg         "/usr/bin/tini -- go_"   24 minutes ago   Up 23 minutes                    3000/tcp                                    paperless-gotenberg-1
85843f762418   apache/tika                        "/bin/sh -c 'exec ja_"   24 minutes ago   Up 23 minutes                    9998/tcp                                    paperless-tika-1
paperless@docker ~/paperless-ng$ docker-compose up
[+] Running 5/5
 _ Container paperless-tika-1       Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-gotenberg-1  Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-db-1         Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-broker-1     Running                                                                                                                                                                                                                                                                         0.0s
 _ Container paperless-webserver-1  Created                                                                                                                                                                                                                                                                         9.2s
Attaching to paperless-broker-1, paperless-db-1, paperless-gotenberg-1, paperless-tika-1, paperless-webserver-1
paperless-webserver-1  | Paperless-ng docker container starting...
paperless-webserver-1  | Creating directory /tmp/paperless
paperless-webserver-1  | Adjusting permissions of paperless files. This may take a while.
paperless-webserver-1  | Waiting for PostgreSQL to start...
paperless-webserver-1  | Apply database migrations...
paperless-webserver-1  | Operations to perform:
paperless-webserver-1  |   Apply all migrations: admin, auth, authtoken, contenttypes, django_q, documents, paperless_mail, sessions
paperless-webserver-1  | Running migrations:
paperless-webserver-1  |   No migrations to apply.
paperless-webserver-1  | Executing /usr/local/bin/supervisord -c /etc/supervisord.conf
paperless-webserver-1  | 2022-02-01 11:22:15,874 INFO Set uid to user 0 succeeded
paperless-webserver-1  | 2022-02-01 11:22:15,875 INFO supervisord started with pid 1
paperless-webserver-1  | 2022-02-01 11:22:16,877 INFO spawned: 'consumer' with pid 36
paperless-webserver-1  | 2022-02-01 11:22:16,879 INFO spawned: 'gunicorn' with pid 37
paperless-webserver-1  | 2022-02-01 11:22:16,881 INFO spawned: 'scheduler' with pid 38
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Starting gunicorn 20.1.0
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Listening at: http://0.0.0.0:8000 (37)
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Using worker: paperless.workers.ConfigurableWorker
paperless-webserver-1  | [2022-02-01 12:22:17 +0100] [37] [INFO] Server is ready. Spawning workers
paperless-webserver-1  | 12:22:17 [Q] INFO Q Cluster romeo-idaho-nine-diet starting.
paperless-webserver-1  | [2022-02-01 12:22:17,742] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /usr/src/paperless/src/../consume
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:1 ready for work at 61
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:2 ready for work at 62
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:3 monitoring at 63
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1 guarding cluster romeo-idaho-nine-diet
paperless-webserver-1  | 12:22:17 [Q] INFO Process-1:4 pushing tasks at 64
paperless-webserver-1  | 12:22:17 [Q] INFO Q Cluster romeo-idaho-nine-diet running.
paperless-webserver-1  | 2022-02-01 11:22:18,836 INFO success: consumer entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1  | 2022-02-01 11:22:18,836 INFO success: gunicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1  | 2022-02-01 11:22:18,836 INFO success: scheduler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1  | 12:22:47 [Q] INFO Enqueued 1
paperless-webserver-1  | 12:22:47 [Q] INFO Process-1 created a task from schedule [Check all e-mail accounts]
paperless-webserver-1  | 12:22:47 [Q] INFO Process-1:1 processing [lithium-edward-diet-utah]
paperless-webserver-1  | /usr/local/lib/python3.9/site-packages/imap_tools/mailbox.py:214: UserWarning: seen method are deprecated and will be removed soon, use flag method instead
paperless-webserver-1  |   warnings.warn('seen method are deprecated and will be removed soon, use flag method instead')
paperless-webserver-1  | 12:22:50 [Q] INFO Process-1:1 stopped doing work
paperless-webserver-1  | 12:22:50 [Q] INFO Processed [lithium-edward-diet-utah]
paperless-webserver-1  | 12:22:50 [Q] INFO recycled worker Process-1:1
paperless-webserver-1  | 12:22:50 [Q] INFO Process-1:5 ready for work at 77
paperless-broker-1     | 1:M 01 Feb 2022 11:23:06.030 * 100 changes in 300 seconds. Saving...
paperless-broker-1     | 1:M 01 Feb 2022 11:23:06.031 * Background saving started by pid 20
paperless-broker-1     | 20:C 01 Feb 2022 11:23:06.044 * DB saved on disk
paperless-broker-1     | 20:C 01 Feb 2022 11:23:06.044 * RDB: 0 MB of memory used by copy-on-write
paperless-broker-1     | 1:M 01 Feb 2022 11:23:06.132 * Background saving terminated with success
paperless-webserver-1  | [2022-02-01 12:24:01,094] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1  | [2022-02-01 12:24:01,184] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1  | [2022-02-01 12:24:04,271] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1  | 12:24:14 [Q] INFO Enqueued 1
paperless-webserver-1  | 12:24:14 [Q] INFO Process-1:2 processing [Dear Facilitators.docx]
paperless-webserver-1  | [2022-02-01 12:24:15,000] [INFO] [paperless.consumer] Consuming Dear Facilitators.docx
paperless-webserver-1  | [2022-02-01 12:24:15,008] [INFO] [paperless.parsing.tika] Sending /tmp/paperless/paperless-upload-zf1ilcyo to Tika server
paperless-tika-1       | INFO  [qtp2128195220-23] 11:24:15,195 org.apache.tika.server.resource.RecursiveMetadataResource rmeta/text (autodetecting type)
paperless-webserver-1  | [2022-02-01 12:24:15,631] [INFO] [paperless.parsing.tika] Converting /tmp/paperless/paperless-upload-zf1ilcyo to PDF as /tmp/paperless/paperless-agiq8vzt/convert.pdf
paperless-gotenberg-1  | {"level":"error","ts":1643714655.6423903,"logger":"api","msg":"code=404, message=Not Found","trace":"8662f7e2-1acd-4f7b-bfe0-fd235b6c1f59","remote_ip":"172.23.0.6","host":"gotenberg:3000","uri":"/convert/office","method":"POST","path":"/convert/office","referer":"","user_agent":"python-requests/2.26.0","status":404,"latency":2408520,"latency_human":"2.40852ms","bytes_in":31351,"bytes_out":9}
paperless-webserver-1  | [2022-02-01 12:24:15,647] [ERROR] [paperless.consumer] Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 79, in convert_to_pdf
paperless-webserver-1  |     response.raise_for_status()  # ensure we notice bad responses
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
paperless-webserver-1  |     raise HTTPError(http_error_msg, response=self)
paperless-webserver-1  | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | During handling of the above exception, another exception occurred:
paperless-webserver-1  |
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
paperless-webserver-1  |     document_parser.parse(self.path, mime_type, self.filename)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 65, in parse
paperless-webserver-1  |     self.archive_path = self.convert_to_pdf(document_path, file_name)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 81, in convert_to_pdf
paperless-webserver-1  |     raise ParseError(
paperless-webserver-1  | documents.parsers.ParseError: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  | 12:24:15 [Q] INFO Process-1:2 stopped doing work
paperless-webserver-1  | 12:24:15 [Q] INFO recycled worker Process-1:2
paperless-webserver-1  | 12:24:15 [Q] INFO Process-1:6 ready for work at 123
paperless-webserver-1  | 12:24:15 [Q] ERROR Failed [Dear Facilitators.docx] - Dear Facilitators.docx: Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office : Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 79, in convert_to_pdf
paperless-webserver-1  |     response.raise_for_status()  # ensure we notice bad responses
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
paperless-webserver-1  |     raise HTTPError(http_error_msg, response=self)
paperless-webserver-1  | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | During handling of the above exception, another exception occurred:
paperless-webserver-1  |
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/asgiref/sync.py", line 288, in main_wrap
paperless-webserver-1  |     raise exc_info[1]
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
paperless-webserver-1  |     document_parser.parse(self.path, mime_type, self.filename)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 65, in parse
paperless-webserver-1  |     self.archive_path = self.convert_to_pdf(document_path, file_name)
paperless-webserver-1  |   File "/usr/src/paperless/src/paperless_tika/parsers.py", line 81, in convert_to_pdf
paperless-webserver-1  |     raise ParseError(
paperless-webserver-1  | documents.parsers.ParseError: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | During handling of the above exception, another exception occurred:
paperless-webserver-1  |
paperless-webserver-1  | Traceback (most recent call last):
paperless-webserver-1  |   File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker
paperless-webserver-1  |     res = f(*task["args"], **task["kwargs"])
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/tasks.py", line 74, in consume_file
paperless-webserver-1  |     document = Consumer().try_consume_file(
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 266, in try_consume_file
paperless-webserver-1  |     self._fail(
paperless-webserver-1  |   File "/usr/src/paperless/src/documents/consumer.py", line 70, in _fail
paperless-webserver-1  |     raise ConsumerError(f"{self.filename}: {log_message or message}")
paperless-webserver-1  | documents.consumer.ConsumerError: Dear Facilitators.docx: Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1  |
paperless-webserver-1  | [2022-02-01 12:24:17 +0100] [37] [CRITICAL] WORKER TIMEOUT (pid:40)
paperless-webserver-1  | [2022-02-01 12:24:17 +0100] [37] [WARNING] Worker with pid 40 was terminated due to signal 6
paperless@docker ~/paperless-ng$ cat docker-compose.yml

# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
# - Paperless is (re)started on system boot, if it was running before shutdown.
# - Docker volumes for storing data are managed by Docker.
# - Folders for importing and exporting files are created in the same directory
#   as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8000.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
# - Apache Tika and Gotenberg servers are started with paperless and paperless
#   is configured to use these services. These provide support for consuming
#   Office documents (Word, Excel, Power Point and their LibreOffice counter-
#   parts.
#
# To install and update paperless with this file, do the following:
#
# - Copy this file as 'docker-compose.yml' and the files 'docker-compose.env'
#   and '.env' into a folder.
# - Run 'docker-compose pull'.
# - Run 'docker-compose run --rm webserver createsuperuser' to create a user.
# - Run 'docker-compose up -d'.
#
# For more extensive installation and update instructions, refer to the
# documentation.

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
      - gotenberg
      - tika
    ports:
      - 8000:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - /home/paperless/paperless-ng/consume:/usr/src/paperless/consume
    env_file: docker-compose.env
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998

  gotenberg:
    image: thecodingmachine/gotenberg
    restart: unless-stopped
    environment:
      DISABLE_GOOGLE_CHROME: 1

  tika:
    image: apache/tika
    restart: unless-stopped

volumes:
  data:
  media:
  pgdata:
greenship24 commented 2 years ago

I have/had the same issue. It looks as if gotenberg has updated their API. An image that does work is thecodingmachine/gotenberg:6.0.0 . I'm unsure when the API was updated (it appears they're not respecting semvar?) as 6.4.4 did not work with paperless-ng either.

So the WA would be to use the 6.0.0 tag.

Paperless-ng will have to be updated to use the newer api which seems to all be under localhost:3000/forms

https://gotenberg.dev/docs/modules/libreoffice

greenship24 commented 2 years ago

https://github.com/jonaswinkler/paperless-ng/commit/2dcacaee147abfdccdca4e20262bae749c60be97

This commit actually fixes it. Just needs to be merged from dev to master and then a new docker image built and pushed.

I'd use the workaround until the maintainers push it to master.

tompsg-git commented 2 years ago

As workaround you can use this

PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-gotenberg:3000/forms/libreoffice/convert#

SB97 commented 2 years ago

As workaround you can use this

PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-gotenberg:3000/forms/libreoffice/convert#

The workaround if your setup is vanilla: docker-compose.yml:

#     PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#
sense4t commented 2 years ago

thx, that was really helpfull !

MegamikeMUC commented 2 years ago

Unfortunatly this workaround didn't work for me. PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert# ist set. I try to import a Word-Doc and get this error:

Error while converting document to PDF: 503 Server Error: Service Unavailable for url: http://gotenberg:3000/forms/libreoffice/convert#/forms/libreoffice/convert
iplaughlin commented 2 years ago

I finally got gotenberg to work. The issue is that, for whatever reason, the container isn't publishing a network port.

Going into portainer and manually publishing the network port of host 3000 and container 3000 resolved the issue of gotenberg not being available. Or adding the lines

ports:
  - 3000:3000

to a docker compose file works.

setting of

PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000 should be used

CodeBrauer commented 2 years ago

Possible solutions I already tried:

Changed endpoint to PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#: Resulted in the error message:

Error while converting document to PDF: 503 Server Error: Service Unavailable for url: 
http://gotenberg:3000/forms/libreoffice/convert#/forms/libreoffice/convert

So changed the endpoint back to default and added the ports like @iplaughlin wrote, error message:

Error while converting document to PDF: 503 Server Error: Service Unavailable for url: 
http://gotenberg:3000/forms/libreoffice/convert

Gotenberg log:

{
  "level": "error",
  "ts": 1650366380.1664767,
  "logger": "api",
  "msg": "convert to PDF: lock long-running LibreOffice listener: acquire LibreOffice listener lock: context deadline exceeded",
  "trace": "52da9339-8761-4dca-bb2e-8ca269ce27ea",
  "remote_ip": "172.18.0.6",
  "host": "gotenberg:3000",
  "uri": "/forms/libreoffice/convert",
  "method": "POST",
  "path": "/forms/libreoffice/convert",
  "referer": "",
  "user_agent": "python-requests/2.27.1",
  "status": 503,
  "latency": 30002593316,
  "latency_human": "30.002593316s",
  "bytes_in": 17375,
  "bytes_out": 19
}
iplaughlin commented 2 years ago

@CodeBrauer - I ended up spinning up gotenberg in its own container, outside of paperless.

MegamikeMUC commented 2 years ago

For my setup only this worked: image: gotenberg/gotenberg:7.4 (it seems it has to be a gotenberg version higher then 7) neither the definition of ports nor the change in endpoint where succesful