Open pranavmishra90 opened 3 years ago
This issue seems to be caused by the usage of dateparser.parse()
here: https://github.com/jonaswinkler/paperless-ng/blob/7bc8325df910ab57ed07849a3ce49a3011ba55b6/src/documents/parsers.py#L221
On most scanned documents, there will be a date without a time or timezone. dateparser
is configured to return a timezone aware datetime, even if no time or timezone is found in the document. However, it does not acknowledge the timezone set by PAPERLESS_TIME_ZONE
. So when it finds a date with no time or timezone (the most common use case) it creates a datetime with the time set to 00:00:00 and the timezone as the server system timezone (UTC by default for docker containers).
Let's take an example where a document has the date Aug 11 2021 at the top of the document when it is scanned.
This is stored in the database as: 2021-08-11 00:00:00+00
.
The problem comes in when this datetime is read later, as before it is displayed to the user it is converted to the PAPERLESS_TIME_ZONE
timezone. So for someone in America/Los_Angeles
this is sent to the frontend as 2021-08-10T17:00:00-07:00
, which is incorrectly displayed by the frontend as 08/10/2021.
Since dateparser
uses the system timezone when no timezone is found in the document text, a workaround is to set the TZ
environment variable to change the system timezone in the docker container. So for example, if you're using the docker-compose.env
file, you'd change it to include a TZ
environment variable:
# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
PAPERLESS_TIME_ZONE=America/Los_Angeles
TZ=America/Los_Angeles <---------- Add this
Unfortunately, this doesn't fix any documents already stored because it only affects the timezone used in the consumption step. All future documents will get a created
fields that is adjusted for timezone, so in the example above the database will store 2021-08-11 07:00:00+0
(instead of 2021-08-11 00:00:00+0
) which will be correctly interpreted by the frontend as 08/11/2021.
(Workaround is also related to https://github.com/jonaswinkler/paperless-ng/issues/872)
One possible fix is to stop storing time/timezone at all, or stop sending time/timezone to the frontend, since the frontend doesn't show the time anyway, only the date. Of course, this would be a breaking change as the API would respond without time or timezone.
Anther possible fix, perhaps it would be helpful to add a setting to dateparser.parse()
that sets the TIMEZONE
setting to the PAPERLESS_TIME_ZONE
. I haven't tested this, but the docs seem to indicate it would work:
from: https://pypi.org/project/dateparser/
Example where TIMEZONE
is used but no timezone in date string, string is interpreted as being in specified timezone:
>>> parse('January 12, 2012 10:00 PM', settings={'TIMEZONE': 'US/Eastern', 'RETURN_AS_TIMEZONE_AWARE': True})
datetime.datetime(2012, 1, 12, 22, 0, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)
In case, when timezone is present both in string and also specified using settings, string is parsed into tzaware representation and then converted to timezone specified in settings.
Describe the bug When paperless-ng ingests a new PDF document, the OCR date parsing for "Date Created" is consistently off by one day.
To Reproduce Steps to reproduce the behavior:
This behavior is consistent for dates in multiple formats (eg mm/dd/yyyy, dd/mm/yyyy, Month Day, YYYY)
Expected behavior Expected the correct date
Screenshots
Webserver logs
(Logs do not appear to show this date parsing)
Relevant information
Appears to be similar to issues #1059 #331.