tarantool / tarantool-python

Python client library for Tarantool
https://www.tarantool.io
BSD 2-Clause "Simplified" License
100 stars 48 forks source link

Support datetime extended type #228

Closed DifferentialOrange closed 2 years ago

DifferentialOrange commented 2 years ago

msgpack: support datetime extended type

Tarantool supports datetime type since version 2.10.0 [1]. This patch introduced the support of Tarantool datetime type in msgpack decoders and encoders.

Tarantool datetime objects are decoded to tarantool.Datetime type. tarantool.Datetime may be encoded to Tarantool datetime objects.

tarantool.Datetime stores data in a pandas.Timestamp object. You can create tarantool.Datetime objects either from msgpack data or by using the same API as in Tarantool:

dt1 = tarantool.Datetime(year=2022, month=8, day=31,
                         hour=18, minute=7, sec=54,
                         nsec=308543321)

dt2 = tarantool.Datetime(timestamp=1661969274)

dt3 = tarantool.Datetime(timestamp=1661969274, nsec=308543321)

tarantool.Datetime exposes year, month, day, hour, minute, sec, nsec and timestamp properties if you need to convert tarantool.Datetime to any other kind of datetime object:

pdt = pandas.Timestamp(year=dt.year, month=dt.month, day=dt.day,
                       hour=dt.hour, minute=dt.minute, second=dt.sec,
                       microsecond=(dt.nsec // 1000),
                       nanosecond=(dt.nsec % 1000))

pandas.Timestamp was chosen to store data because it could be used to store both nanoseconds and timezone information. In-build Python datetime.datetime supports microseconds at most, numpy.datetime64 do not support timezones.

Tarantool datetime interval type is planned to be stored in custom type tarantool.Interval and we'll need a way to support arithmetic between datetime and interval. This is the main reason we use custom class instead of plain pandas.Timestamp. It is also hard to implement Tarantool-compatible timezones with full conversion support without custom classes.

This patch does not yet introduce the support of timezones in datetime.

  1. https://github.com/tarantool/tarantool/issues/5941
  2. https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html

Part of #204

msgpack: support tzoffset in datetime

Support non-zero tzoffset in datetime extended type.

Use tzoffset parameter to set up offset timezone:

dt = tarantool.Datetime(year=2022, month=8, day=31,
                        hour=18, minute=7, sec=54,
                        nsec=308543321, tzoffset=180)

You may use tzoffset property to get timezone offset of a datetime object.

Offset timezone is built with pytz.FixedOffset(). pytz module is already a dependency of pandas, but this patch adds it as a requirement just in case something will change in the future.

This patch doesn't yet introduce the support of named timezones (tzindex).

Part of #204

msgpack: support tzindex in datetime

Support non-zero tzindex in datetime extended type. If both tzoffset and tzindex are specified, tzindex is prior (same as in Tarantool [1]).

Use tz parameter to set up timezone name:

dt = tarantool.Datetime(year=2022, month=8, day=31,
                        hour=18, minute=7, sec=54,
                        nsec=308543321, tz='Europe/Moscow')

You may use tz property to get timezone name of a datetime object.

pytz is used to build timezone info. Tarantool index to Olson name map and inverted one are built with gen_timezones.sh script based on tarantool/go-tarantool script [2]. All Tarantool unique and alias timezones presents in pytz.all_timezones list. Only the following abbreviated timezones from Tarantool presents in pytz.all_timezones (version 2022.2.1):

pytz does not natively support work with abbreviated timezones due to its possibly ambiguous nature [3-5]. Tarantool itself do not support work with ambiguous abbreviated timezones:

Tarantool 2.10.1-0-g482d91c66

tarantool> datetime.new({tz = 'BST'})
---
- error: 'builtin/datetime.lua:477: could not parse ''BST'' - ambiguous timezone'
...

If ambiguous timezone is specified, the exception is raised.

Tarantool header timezones.h [6] provides a map for all abbreviated timezones with category info (all ambiguous timezones are marked with TZ_AMBIGUOUS flag) and offset info. We parse this info to build pytz.FixedOffset() timezone for each Tarantool abbreviated timezone not supported natively by pytz.

  1. https://www.tarantool.io/en/doc/latest/reference/reference_lua/datetime/new/
  2. https://github.com/tarantool/go-tarantool/blob/5801dc6f5ce69db7c8bc0c0d0fe4fb6042d5ecbc/datetime/gen-timezones.sh
  3. https://stackoverflow.com/questions/37109945/how-to-use-abbreviated-timezone-namepst-ist-in-pytz
  4. https://stackoverflow.com/questions/27531718/datetime-timezone-conversion-using-pytz
  5. https://stackoverflow.com/questions/30315485/pytz-return-olson-timezone-name-from-only-a-gmt-offset
  6. https://github.com/tarantool/tarantool/9ee45289e01232b8df1413efea11db170ae3b3b4/src/lib/tzcode/timezones.h

Closes #204

DifferentialOrange commented 2 years ago

Class inheritance should be improved before review.

>>> dt = tarantool.Datetime(year = 1970, month = 1, day = 2)
>>> dt
Timestamp('1970-01-02 00:00:00')
>>> type(dt)
<class 'tarantool.msgpack_ext.types.datetime.Datetime'>
>>> dt.floor('H')
Timestamp('1970-01-02 00:00:00')
>>> type(dt.floor('H'))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
DifferentialOrange commented 2 years ago

Class inheritance should be improved before review.

>>> dt = tarantool.Datetime(year = 1970, month = 1, day = 2)
>>> dt
Timestamp('1970-01-02 00:00:00')
>>> type(dt)
<class 'tarantool.msgpack_ext.types.datetime.Datetime'>
>>> dt.floor('H')
Timestamp('1970-01-02 00:00:00')
>>> type(dt.floor('H'))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Reworked

DifferentialOrange commented 2 years ago

@oleg-jukovec , thank you for pointing out about custom timezones. Now we encode abbreviated timezones to datetime.timezone with expected offset and Tarantool name. Decode for them is also supported.

If user wants to create a tarantool.Datetime with Tarantool abbreviated timezone, he may build the custom timezone based on our autogenerated timezones info:

import tarantool.msgpack_ext.types.timezones as tt_timezones

tzinfo = datetime.timezone(
    datetime.timedelta(minutes=tt_timezones.timezoneAbbrevInfo['MSK']['offset']),
    name='MSK'
)

dt = tarantool.Datetime(
    year=2022, month=8, day=31, hour=18, minute=7, second=54,
    microsecond=308543, nanosecond=321, tzinfo=tzinfo
)

I'm not sure we should expose tt_timezones.timezoneAbbrevInfo to a main tarantool module, but this is debatable.

DifferentialOrange commented 2 years ago

Timezones in tarantool-python RFC

Tarantool datetime may provide timezone info in two forms: tzoffset and tzindex.

It corresponds to Lua datetime.new{} tzoffset and tz arguments. tz string is mapped to a tzindex based on timezones.h header.

Based on Lua API, user can set up only tz (same as tzindex) or tzoffset. If tz (same as tzindex) is set up, tzoffset is computed based on tz info. Both tzindex and tzoffset is provided to msgpack data if tzindex is set up, but tzoffset is expected to be based on tzindex info: see tarantool/tarantool#7680.

What should we use to store timezone info? The solution should satisfy several criteria:

Let's discuss both of them.

We should unambiguously be able to encode it to the same msgpack, i.e. the same tzindex, tzoffset info. Actually, it's more tricky: not all timezones are fixed offset. For example, tzoffset for 'Europe/Moscow' timezone is 180 for 1.1.2008 and 240 for 1.7.2008. So what we actually want is to be able to preserve tzindex and to have tzoffset the same as it would be in Tarantool for same timestamp + tz (and this is really important, see tarantool/tarantool#7680 again).

What does it mean that it should be useful? It should be not just a reference info, but a real timezone. If user would want to do something with datetime info, it should behave appropriately. Again, for example, if user gets 1.1.2008 with 'Europe/Moscow' and then adds half a year with some default method, it should be 1.7.2008 with 'Europe/Moscow' considering winter time. It is also important since we encode timestamp values based on time since 1.1.1970 UTC and it depends on thing like winter time.

There is a pytz library (already a dependency of pandas) that implements Olson database that tarantool uses to compute tzoffset for named timezones. We may use pytz library to work with timezone info. All of Tarantool ZONE_UNIQUEs and ZONE_ALIASes are supported by pytz. Tarantool also has ZONE_ABBREVs: timezones with name and fixed offset. pytz doesn't know about most of them, but it is easy to implement them manually with pytz.FixedOffset or datetime.timezone(datetime.timedelta) based on timezones.h header offset info.

It is rather inconvenient to store tzindex (or tz name) with some existing pytz or datetime tools. For example, we need to distinct fixed offset timezones with name and without a name. You can set up a name for datetime.timezone(datetime.timedelta), but it could be retrieved only with tz.tzname(dt) call. datetime.timezone generates the name on tzname call and there is no non-intrusive way to distinct autogenerated name from explicitly set up name. pytz.FixedOffset could not have any name at all (expect for pytz.FixedOffset(0) which is actually UTC). So it looks like the only way is

to decode tarantool timezones to custom tarantool.Timezone type.

Since we already use pytz.timezone, let's use pytz.FixedOffset as a base class for fixed timezone data.

This type should be useful, so it should implement standard datetime.tzinfo interface (utcoffset, tzname and dst). It would simply expose utcoffset, tzname and dst methods of timezone data underneath class. With some additional handles, it would expose tzindex (or tarantool tz name) and the copy of underneath class (just in case).

tarantool.Datetime supports building from pandas.Timestamp or with tzinfo argument. tzinfo argument or pandas.Timestamp.tzinfo may be not a tarantool.Timezone. Using only tarantool.Timezone in tarantool.Datetime is a way to ensure that everything would be symmetrical on encode/decode. So there are two possible ways:

If we won't be able to accept any other timezones, it would be an another burden on user's shoulders. To impove his experience, we may provide some migration advices.

On the other hand, converting may be provided not as expected by user. Since

Let's describe converting rules. If tzinfo is a pytz base class (pytz.tzinfo.BaseTzInfo), we check its .zone attribute, and if it is not None, we use it as timezone name. In result, tarantool.Datetime(name=zone) and pytz.timezone(zone) would have the same zone underneath.

If tzinfo is not a pytz base class, we call tz.tzname(dt), defined by interface, to get a timezone name. For example, pytz._FixedOffset (it is not an instance of pytz.tzinfo.BaseTzInfo) has None name. We not use tz.tzname(dt) for pytz base class because it's output is frustrating. For example, tz.tzname(dt) for pytz.timezone('Europe/Moscow') is either MSK or MSD, and tarantool.Datetime(name='Europe/Moscow') and tarantool.Datetime(name='MSK') are different timezones.

If timezone has a name that is unknown to Tarantool, we raise an error. If timezone has None name, we treat it as fixed offset zone.

We would not implement tzindex/tzoffset correspondence checkup. We will wait for tarantool/tarantool#7680 updates.

DifferentialOrange commented 2 years ago

to decode tarantool timezones to custom tarantool.Timezone type.

Well, actually, you simply can't do it. pandas cannot work with any datetime.tzinfo instance -- it could work only with pytz or dateutil timezones.

https://github.com/pandas-dev/pandas/issues/15986#issuecomment-315054517 https://github.com/sdispater/pendulum/issues/131

pandas source code is full of workaround to detect if it is pytz timezone and mess with its internals.

DifferentialOrange commented 2 years ago

After discussion with @oleg-jukovec , we decided to implement tarantool.Datetime API to be the same as in tarantool Lua datetime module. You can build datetime from msgpack payload or with the same API as in Tarantool. Object expose all properties required to convert it to any other datetime (year, month, day, hour, minute, sec, nsec, timestamp, tzoffset, tz -- names are the same except for minute instead of min since it is a keyword in Python) but do not support in-built convertions to pandas or do not expose internal pandas.Timestamp or pytz timezone to simplify the behavior.

DifferentialOrange commented 2 years ago

@oleg-jukovec , @LeonidVas , new revision had been uploaded, humbly requesting one more review iteration.

DifferentialOrange commented 2 years ago

I think this is a clearer solution than previous.

Yeah, it definitely is. Thank you for your advises, the last version of my implementation was dissatisfying for myself too.