paperless-ngx / paperless-ngx

A community-supported supercharged version of paperless: scan, index and archive all your physical documents
https://docs.paperless-ngx.com
GNU General Public License v3.0
19.91k stars 1.09k forks source link

[BUG] Paperless fails import at date parsing stage #1201

Closed ghost closed 2 years ago

ghost commented 2 years ago

Description

When trying to upload documents, paperless fails the import at the date parsing stage. I previously did not have this issue, so I expect that an upgrade of some dependency maybe broke something ? See the attached logs for more details. As far as I can tell, this happens with all documents.

Steps to reproduce

1) Click on upload 2) Select a PDF 3) Wait for consumption to fail

Webserver logs

Failed task logs:

bad escape \d at position 7 : Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/django_q/cluster.py", line 432, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/share/paperless/src/documents/tasks.py", line 298, in consume_file
document = Consumer().try_consume_file(
File "/usr/share/paperless/src/documents/consumer.py", line 275, in try_consume_file
date = parse_date(self.filename, text)
File "/usr/share/paperless/src/documents/parsers.py", line 265, in parse_date
date = __parser(date_string, settings.DATE_ORDER)
File "/usr/share/paperless/src/documents/parsers.py", line 223, in __parser
return dateparser.parse(
File "/usr/lib/python3.10/site-packages/dateparser/conf.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/dateparser/__init__.py", line 61, in parse
data = parser.get_date_data(date_string, date_formats)
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 428, in get_date_data
parsed_date = _DateLocaleParser.parse(
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 178, in parse
return instance._parse()
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 182, in _parse
date_data = self._parsers[parser_name]()
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 234, in _get_translated_date
self._translated_date = self.locale.translate(
File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 131, in translate
relative_translations = self._get_relative_translations(settings=settings)
File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
self._generate_relative_translations(normalize=True))
File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
File "/usr/lib/python3.10/site-packages/regex/regex.py", line 702, in _compile_replacement_helper
is_group, items = _compile_replacement(source, pattern, is_unicode)
File "/usr/lib/python3.10/site-packages/regex/_regex_core.py", line 1737, in _compile_replacement
raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7

Webserver logs:

[2022-07-06 08:04:55,324] [INFO] [paperless.consumer] Consuming numérisation0007.pdf
[2022-07-06 08:04:55,325] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-07-06 08:04:55,329] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-07-06 08:04:55,338] [DEBUG] [paperless.consumer] Parsing numérisation0007.pdf...
[2022-07-06 08:04:55,378] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /var/lib/paperless/uploads/paperless-upload-yry9h6qi
[2022-07-06 08:04:55,424] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/var/lib/paperless/uploads/paperless-upload-yry9h6qi', 'output_file': '/var/lib/paperless/uploads/paperless-y9ajutwd/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/var/lib/paperless/uploads/paperless-y9ajutwd/sidecar.txt'}
[2022-07-06 08:05:24,512] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-07-06 08:05:24,513] [DEBUG] [paperless.consumer] Generating thumbnail for numérisation0007.pdf...
[2022-07-06 08:05:24,520] [DEBUG] [paperless.parsing] Execute: /usr/bin/convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /var/lib/paperless/uploads/paperless-y9ajutwd/archive.pdf[0] /var/lib/paperless/uploads/paperless-y9ajutwd/convert.png
[2022-07-06 08:05:27,913] [DEBUG] [paperless.parsing.tesseract] Execute: /usr/bin/optipng -silent -o5 /var/lib/paperless/uploads/paperless-y9ajutwd/convert.png -out /var/lib/paperless/uploads/paperless-y9ajutwd/thumb_optipng.png
[2022-07-06 08:08:17,969] [DEBUG] [paperless.classifier] Gathering data from database...
[2022-07-06 08:08:18,316] [DEBUG] [paperless.tasks] Training data unchanged.

Paperless-ngx version

1.7.1

Host OS

Archlinux

Installation method

Bare metal

Configuration changes

None

Other

Package versions for dependencies of paperless: python-aiohttp 3.8.1-4 python-aioredis1 1.3.1-1 python-arrow 1.2.1-3 python-asgiref 3.5.2-1 python-async-timeout 4.0.2-1 python-attrs 21.4.0-1 python-autobahn 22.5.1-1 python-automat 20.2.0-9 python-blessed 1.19.1-2 python-certifi 2022.06.15-1 python-django-channels 3.0.4-1 python-django-channels-redis 3.3.1-1 python-chardet 4.0.0-5 python-click 8.1.3-1 python-concurrent-log-handler 0.9.19-1 python-constantly 15.1.0-11 python-cryptography 37.0.2-1 python-daphne 3.0.2-1 python-dateparser 1.1.1-1 python-django 4.0.5-1 python-django-cors-headers 3.11.0-0 python-django-extensions 3.1.5-3 python-django-filter 22.1-1 python-django-picklefield 3.1.0-1 python-django-q 1.3.9-4 python-django-rest-framework 3.13.1-1 python-filelock 3.5.1-1 python-fuzzywuzzy 0.18.0-5 python-h11 0.12.0-3 python-hiredis 2.0.0-3 python-httptools 0.3.0-3 python-humanfriendly 10.0-3 python-hyperlink 21.0.0-4 python-idna 3.3-4 python-imap-tools 0.41.0-1 python-incremental 21.3.0-5 python-inotify-simple 1.3.5-1 python-inotifyrecursive 0.3.5-1 python-joblib 1.1.0-3 python-langdetect 1.0.9-3 python-lxml 4.9.0-1 python-msgpack 1.0.3-1 python-numpy 1.22.4-1 python-pathvalidate 2.5.0-1 python-pdf2image 1.16.0-1 python-portalocker 2.4.0-1 python-psycopg2 2.9.3-1 python-pyasn1 0.4.8-7 python-pyasn1-modules 0.2.8-6 python-pycparser 2.21-3 python-pyopenssl 22.0.0-1 python-dateutil 2.8.2-4 python-dotenv 0.20.0-1 python-gnupg 0.4.9-1 python-levenshtein 0.12.2-3 python-magic-ahupp 0.4.25-1 python-pytz 2022.1-1 python-yaml 6.0-1 python-redis 4.0.2-1 python-regex 2022.6.2-1 python-requests 2.27.1-1 python-scikit-learn 1.1.1-1 python-scipy 1.8.1-1 python-service-identity 21.1.0-4 python-six 1.16.0-5 python-sortedcontainers 2.4.0-3 python-sqlparse 0.4.2-3 python-threadpoolctl 3.1.0-1 python-tika 1.16-1 python-twisted 21.7.0-4 python-txaio 22.2.1-1 python-tzlocal 1:2.1-1 python-urllib3 1.26.9-1 python-uvloop 0.16.0-3 python-watchdog 0.10.7-3 python-watchgod 0.13-1 python-wcwidth 0.2.5-6 python-websockets 10.3-1 python-whitenoise 6.2.0-1 python-whoosh 2.7.4-9 python-zope-interface 5.4.0-4 pyzbar 0.1.8-3

ghost commented 2 years ago

A bit more digging has in fact led me to find this https://github.com/scrapinghub/dateparser/issues/1052 as being responsible, so nothing to do with paperless. Closing :)

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.