jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 358 forks source link

OCR in wrong languages #1480

Open trailingslash opened 2 years ago

trailingslash commented 2 years ago

Hi,

I'm running paperless-ng in Docker on an amd64 Ubuntu server.

When I add a document through the WebUI, it processes from some time without any errors in the logs, reports the document is ready, and is OCR'd in the wrong language. The first document I tried was in Norwegian, the second was in Chinese and English.

Paperless-ng only OCR'd in English in both cases - any Norwegian and Chinese letters/characters was in an English OCR output.

Logs at the bottom.

# This is my docker-compose.env

###############################################################################
# Paperless-specific settings                                                 #
###############################################################################

PAPERLESS_OCR_LANGUAGES=chi-sim chi-sim-vert chi-tra chi-tra-vert eng nor

# All settings defined in the paperless.conf.example can be used here. The
# Docker setup does not use the configuration file.
# A few commonly adjusted settings are provided below.
PAPERLESS_CONSUMPTION_DIR=../consume
PAPERLESS_DATA_DIR=../data
PAPERLESS_MEDIA_ROOT=../media
PAPERLESS_STATICDIR=../static

# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
PAPERLESS_SECRET_KEY=redacted
PAPERLESS_ALLOWED_HOSTS=paper.redacted.dev,192.168.1.2
PAPERLESS_CORS_ALLOWED_HOSTS=https://paper.redacted.dev,http://localhost:8001

# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
PAPERLESS_TIME_ZONE=Europe/Oslo
USERMAP_UID=1000
USERMAP_GID=100
PAPERLESS_OPTIMIZE_THUMBNAILS=true
PAPERLESS_TASK_WORKERS=2
PAPERLESS_THREADS_PER_WORKER=2
PAPERLESS_CONSUMER_POLLING=30

# The default language to use for OCR. Set this to the language most of your
# documents are written in.
PAPERLESS_OCR_LANGUAGE=nor+eng
PAPERLESS_OCR_MODE=skip
#PAPERLESS_OCR_OUTPUT_TYPE=pdfa
#PAPERLESS_OCR_PAGES=1
#PAPERLESS_OCR_IMAGE_DPI=300
PAPERLESS_OCR_CLEAN=clean-final
PAPERLESS_OCR_DESKEW=true
#PAPERLESS_OCR_ROTATE_PAGES=true
#PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=12.0
#PAPERLESS_OCR_USER_ARGS={}
#PAPERLESS_CONVERT_MEMORY_LIMIT=0
#PAPERLESS_CONVERT_TMPDIR=/var/tmp/paperless

# Required services

PAPERLESS_REDIS=redis://localhost:6379
PAPERLESS_DBHOST=localhost
PAPERLESS_DBPORT=5432
PAPERLESS_DBNAME=redacted
PAPERLESS_DBUSER=redacted
PAPERLESS_DBPASS=redacted
PAPERLESS_DBSSLMODE=prefer

# And this is my docker-compose.yml

version: "3.4"
services:
  broker:
    image: redis:6.0
    restart: unless-stopped

  db:
    image: postgres:13
    restart: unless-stopped
    volumes:
      - /srv/dev-disk-by-uuid-75373a91-92f1-40da-9d8e-d23e2992f002/appdata/paperless/pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: redacted
      POSTGRES_USER: redacted
      POSTGRES_PASSWORD: redacted

  webserver:
    image: jonaswinkler/paperless-ng:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - 8001:8000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - /srv/dev-disk-by-uuid-75373a91-92f1-40da-9d8e-d23e2992f002/appdata/paperless/data:/usr/src/paperless/data
      - /srv/dev-disk-by-uuid-75373a91-92f1-40da-9d8e-d23e2992f002/library/paperless/media:/usr/src/paperless/media
      - /srv/dev-disk-by-uuid-75373a91-92f1-40da-9d8e-d23e2992f002/library/paperless/export:/usr/src/paperless/export
      - /srv/dev-disk-by-uuid-75373a91-92f1-40da-9d8e-d23e2992f002/library/paperless/consume:/usr/src/paperless/consume
    env_file: docker-compose.env
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db

volumes:
  data:
  media:
  pgdata:

# Logs

[2021-12-11 15:31:26,263] [INFO] [paperless.management.consumer] Polling directory for changes: ../consume
[2021-12-11 15:31:56,659] [INFO] [paperless.sanity_checker] Sanity checker detected no issues.
[2021-12-11 15:36:17,464] [INFO] [paperless.management.consumer] Polling directory for changes: ../consume
[2021-12-11 15:40:06,081] [INFO] [paperless.management.consumer] Polling directory for changes: ../consume
[2021-12-11 15:43:11,461] [INFO] [paperless.management.consumer] Polling directory for changes: ../consume
[2021-12-11 15:44:51,868] [INFO] [paperless.consumer] Consuming komp_norsk.pdf
[2021-12-11 15:44:51,870] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-12-11 15:44:51,878] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-12-11 15:44:51,881] [DEBUG] [paperless.consumer] Parsing komp_norsk.pdf...
[2021-12-11 15:44:52,347] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-k3wdzhn8
[2021-12-11 15:44:52,474] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-k3wdzhn8', 'output_file': '/tmp/paperless/paperless-a5058alm/archive.pdf', 'use_threads': True, 'jobs': '2', 'language': 'nor+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean_final': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-a5058alm/sidecar.txt'}
[2021-12-11 15:58:40,832] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-12-11 15:58:40,838] [DEBUG] [paperless.consumer] Generating thumbnail for komp_norsk.pdf...
[2021-12-11 15:58:40,841] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-a5058alm/archive.pdf[0] /tmp/paperless/paperless-a5058alm/convert.png
[2021-12-11 15:58:42,124] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-a5058alm/convert.png -out /tmp/paperless/paperless-a5058alm/thumb_optipng.png
[2021-12-11 15:58:45,441] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-12-11 15:58:45,444] [DEBUG] [paperless.consumer] Saving record to database
[2021-12-11 15:58:45,664] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-k3wdzhn8
[2021-12-11 15:58:45,677] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-a5058alm
[2021-12-11 15:58:45,678] [INFO] [paperless.consumer] Document 2019-12-11 komp_norsk consumption finished
[2021-12-11 17:17:02,226] [INFO] [paperless.management.consumer] Polling directory for changes: ../consume
[2021-12-11 17:22:45,174] [INFO] [paperless.management.consumer] Polling directory for changes: ../consume
[2021-12-11 17:25:38,928] [INFO] [paperless.consumer] Consuming chinese_reader_one.pdf
[2021-12-11 17:25:38,930] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-12-11 17:25:38,938] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-12-11 17:25:38,942] [DEBUG] [paperless.consumer] Parsing chinese_reader_one.pdf...
[2021-12-11 17:25:40,381] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-vq3oewom
[2021-12-11 17:25:40,504] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-vq3oewom', 'output_file': '/tmp/paperless/paperless-rw0z2q35/archive.pdf', 'use_threads': True, 'jobs': '2', 'language': 'nor+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean_final': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-rw0z2q35/sidecar.txt'}
[2021-12-11 17:26:19,093] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-12-11 17:32:00,040] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-12-11 17:32:00,046] [DEBUG] [paperless.consumer] Generating thumbnail for chinese_reader_one.pdf...
[2021-12-11 17:32:00,050] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-rw0z2q35/archive.pdf[0] /tmp/paperless/paperless-rw0z2q35/convert.png
[2021-12-11 17:32:01,620] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-rw0z2q35/convert.png -out /tmp/paperless/paperless-rw0z2q35/thumb_optipng.png
[2021-12-11 17:32:10,392] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-12-11 17:32:10,396] [DEBUG] [paperless.consumer] Saving record to database
[2021-12-11 17:32:10,750] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-vq3oewom
[2021-12-11 17:32:10,762] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-rw0z2q35
[2021-12-11 17:32:10,765] [INFO] [paperless.consumer] Document 2021-12-11 chinese_reader_one consumption finished
[2021-12-11 17:33:34,282] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
trailingslash commented 2 years ago

Output of tesseract --list-langs

List of available languages (11):
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
deu
eng
fra
ita
nor
osd
spa
trailingslash commented 2 years ago

Update: I've included more langs in the "PAPERLESS_OCR_LANGUAGE=nor+eng" value in docker-compose.env

It seems to recognize both Norwegian and Chinese now. However, the OCR quality of the Chinese books are unfathomably bad, it injects capitalized Latin letters where it should be Chinese characters.

My question is now - is there any way to crank up the OCR quality? It doesn't really matter to me if it takes a day to scan a single book, as long as the OCR is reasonably on point.

wswv commented 2 years ago

Meet same issue, the Chinese language text is almost can not recognize.