jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 356 forks source link

[BUG] "Date Created" auto parsing consistently off by 1 day on ingestion #1301

Open pranavmishra90 opened 3 years ago

pranavmishra90 commented 3 years ago

Describe the bug When paperless-ng ingests a new PDF document, the OCR date parsing for "Date Created" is consistently off by one day.

To Reproduce Steps to reproduce the behavior:

  1. Add PDF into docker folder /consume
  2. Wait for paperless-ng to import the file
  3. Click the edit "pencil" icon for the newly detected file
  4. Look at the "Date Created" field and cross check against the scanned document's preview on the right. The parsed date will be -1 days off from the correct date.

This behavior is consistent for dates in multiple formats (eg mm/dd/yyyy, dd/mm/yyyy, Month Day, YYYY)

Expected behavior Expected the correct date

Screenshots paperless-ng date issue

Webserver logs

(Logs do not appear to show this date parsing)

Relevant information

Appears to be similar to issues #1059 #331.

btorresgil commented 3 years ago

This issue seems to be caused by the usage of dateparser.parse() here: https://github.com/jonaswinkler/paperless-ng/blob/7bc8325df910ab57ed07849a3ce49a3011ba55b6/src/documents/parsers.py#L221

On most scanned documents, there will be a date without a time or timezone. dateparser is configured to return a timezone aware datetime, even if no time or timezone is found in the document. However, it does not acknowledge the timezone set by PAPERLESS_TIME_ZONE. So when it finds a date with no time or timezone (the most common use case) it creates a datetime with the time set to 00:00:00 and the timezone as the server system timezone (UTC by default for docker containers).

Let's take an example where a document has the date Aug 11 2021 at the top of the document when it is scanned.

This is stored in the database as: 2021-08-11 00:00:00+00.

The problem comes in when this datetime is read later, as before it is displayed to the user it is converted to the PAPERLESS_TIME_ZONE timezone. So for someone in America/Los_Angeles this is sent to the frontend as 2021-08-10T17:00:00-07:00, which is incorrectly displayed by the frontend as 08/10/2021.

Workaround

Since dateparser uses the system timezone when no timezone is found in the document text, a workaround is to set the TZ environment variable to change the system timezone in the docker container. So for example, if you're using the docker-compose.env file, you'd change it to include a TZ environment variable:

# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
PAPERLESS_TIME_ZONE=America/Los_Angeles
TZ=America/Los_Angeles                         <---------- Add this

Unfortunately, this doesn't fix any documents already stored because it only affects the timezone used in the consumption step. All future documents will get a created fields that is adjusted for timezone, so in the example above the database will store 2021-08-11 07:00:00+0 (instead of 2021-08-11 00:00:00+0) which will be correctly interpreted by the frontend as 08/11/2021.

(Workaround is also related to https://github.com/jonaswinkler/paperless-ng/issues/872)

Possible fixes

One possible fix is to stop storing time/timezone at all, or stop sending time/timezone to the frontend, since the frontend doesn't show the time anyway, only the date. Of course, this would be a breaking change as the API would respond without time or timezone.

Anther possible fix, perhaps it would be helpful to add a setting to dateparser.parse() that sets the TIMEZONE setting to the PAPERLESS_TIME_ZONE. I haven't tested this, but the docs seem to indicate it would work:

from: https://pypi.org/project/dateparser/ Example where TIMEZONE is used but no timezone in date string, string is interpreted as being in specified timezone:

>>> parse('January 12, 2012 10:00 PM', settings={'TIMEZONE': 'US/Eastern', 'RETURN_AS_TIMEZONE_AWARE': True})
datetime.datetime(2012, 1, 12, 22, 0, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)

In case, when timezone is present both in string and also specified using settings, string is parsed into tzaware representation and then converted to timezone specified in settings.