scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.57k stars 465 forks source link

TIMEZONE setting unable to handle two-digit UTC offset #1192

Open cwfoo opened 1 year ago

cwfoo commented 1 year ago

These UTC offsets behave as expected:

import dateparser
dateparser.parse('tomorrow', settings={'TIMEZONE': 'UTC+8'})
dateparser.parse('tomorrow', settings={'TIMEZONE': 'UTC+08'})
dateparser.parse('tomorrow', settings={'TIMEZONE': 'UTC+0800'})
dateparser.parse('tomorrow', settings={'TIMEZONE': 'UTC+8:00'})
dateparser.parse('tomorrow', settings={'TIMEZONE': 'UTC+08:00'})
dateparser.parse('tomorrow', settings={'TIMEZONE': '+0800'})
dateparser.parse('tomorrow', settings={'TIMEZONE': '+08:00'})

However, if the UTC offset timezone omits both "UTC" and the minutes offset, there will be an error. Example:

import dateparser
dateparser.parse('tomorrow', settings={'TIMEZONE': '+08'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/dateparser/conf.py", line 92, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/__init__.py", line 61, in parse
    data = parser.get_date_data(date_string, date_formats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/date.py", line 451, in get_date_data
    parsed_date = _DateLocaleParser.parse(
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/date.py", line 200, in parse
    return instance._parse()
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/date.py", line 204, in _parse
    date_data = self._parsers[parser_name]()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/date.py", line 224, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/freshness_date_parser.py", line 156, in get_date_data
    date, period = self.parse(date_string, settings)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/freshness_date_parser.py", line 91, in parse
    now = apply_timezone(utc_dt, settings.TIMEZONE)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/utils/__init__.py", line 119, in apply_timezone
    new_datetime = apply_tzdatabase_timezone(date_time, tz_string)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/dateparser/utils/__init__.py", line 94, in apply_tzdatabase_timezone
    usr_timezone = timezone(pytz_string)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pytz/__init__.py", line 201, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: '+08'

Dateparser should support two-digit UTC offsets because Python standard libraries sometimes return such offsets. For example:

$ TZ=:Asia/Singapore python3
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> datetime.datetime.now().astimezone().tzname()
'+08'

Please fix dateparser so that the TIMEZONE setting is able to handle two-digit UTC offsets such as '+08'.

anarcat commented 1 year ago

I suggested @cwfoo open this bug report, but after investigating this issue a little further, it does seem like an odd timezone parameter...

I have a patch for undertime in here that tries to workaround that issue:

https://gitlab.com/anarcat/undertime/-/merge_requests/22

I'm not sure what the right way to go here. The best would be for dateparser to accept actual tzinfo objects instead of having to pass them as a string in the environment.

Gallaecio commented 1 year ago

but after investigating this issue a little further, it does seem like an odd timezone parameter...

Indeed.

The best would be for dateparser to accept actual tzinfo objects instead of having to pass them as a string in the environment.

Sounds like a valid enhancement.

Maybe you could edit the title and description of the issue to be about this enhancement. Or close this issue and open a new one about the enhancement.