golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.25k stars 17.57k forks source link

proposal: time: POSIX style TZ strings on Unix and timezone handling optimisations #64659

Open unixdj opened 9 months ago

unixdj commented 9 months ago

Proposal Details

proposal: time: POSIX style TZ strings on Unix and timezone handling optimisations

Dear Gophers,

This proposal is about local timezone initialisation on Unix and other improvements to timezone handling. I already implemented most of the proposed features, but wanted to discuss it before submitting patches.

Related proposals:

CC: @rsc

References:

tzcode is the code part of Zoneinfo, dealing with Zoneinfo files and timezone conversions. It's used in glibc and other Unix libc implementations.

A compiled Zoneinfo file contains zero or more static transitions and a TZ string that applies after the last static transition. The TZ string describes either a static zone or a pair of rules describing yearly transition times and target zones.

Introduction: TZ environment variable on Unix (libc/tzcode)

On Unix, the time package reads the local timezone information from a Zoneinfo file according to the value of the TZ environment variable: if it's unset, from /etc/localtime; if it's <file> or :<file>, from <file>. In case of any failure, UTC is used.

libc behaves similarly, but if the named file can not be read and the value does not start with ":", the value is parsed as a POSIX style TZ string. E.g., TZ=JST-9 date will display the date in a timezone named "JST" at UTC+9, and TZ=CET-1CEST,M3.5.0,M10.5.0/3 date in CET UTC+1 or CEST UTC+2 DST, the latter between last Sunday of March 02:00 CST and last Sunday of October 03:00 CEST.

POSIX style TZ strings

It would be nice to add support for such TZ settings to Go, to bring it in line with the rest of the system. The time package already has a parser for such strings, as they are used in compiled Zoneinfo files for timestamps after the last static transition.

The implementation requires a new error type for unknown timezones to be returned from loadLocation, so that initLocal can check the error and call tzset only when the zone is not found, and not on other errors.

Questions:

FWIW, there's a comment near LoadLocation:

// NOTE(rsc): Eventually we will need to accept the POSIX TZ environment
// syntax too, but I don't feel like implementing it today.

TZ string: limits

tzcode allows absolute UTC offsets less than 25 hours (up to 24:59:59), and time in rules less than 168 hours (7 days). The former is a POSIX requirement, the latter a Zoneinfo extension. Go currently allows <168 hours for both. I propose limiting allowed UTC offsets to match those of tzcode.

Optimisation: rules

Rationale: The current caching approach is based on the assumption that most timezone lookups will be for timestamps around the present. In all but two zoneinfo timezones the TZ string apples in the present (late 2023). Most suggestions here are either pure optimisation or moving calculations from lookup time to be done once at load time.

TZ string parsing

After loading a zoneinfo file, the TZ string is kept in the Location struct and is parsed on every non-cached lookup after the last static transition, whether it describes rules or a static zone. Currently, TZ strings in over 2/3 of all unique Zoneinfo locations, including the two most populated ones ("Asia/Shanghai" and "Asia/Kolkata"), specify static zones.

My proposal is:

Day of week calculation

The only rule kind used in practice is the "M" rule, containing month, week and day of week of the transition. These are used to compute the day of year.

Simplifying the rule structure

Rule normalisation

With normalised rules the transition happens between year days day and day + 7, inclusive (adding 0-1 days for leap years and 0-6 days for day of week). Without it, between day - 14 and day + 21 (also adding -14 to 14 days for UTC offset and transition time).

Code

After implementing all of the above, and changing tzruleTime to accept the return values of dayOfEpoch(year) and isLeap(year) instead of year and return Unix time, it looks like this (with comments stripped):

func tzruleTime(yearStartDay uint64, r rule, leapYear bool) int64 {
    d := int(yearStartDay) + r.day
    if leapYear && r.addLeapDay {
        d++
    }
    if r.dow >= 0 {
        delta := (d - r.dow) % 7
        d += 6 - delta
    }
    return int64(d*secondsPerDay+r.time) + absoluteToInternal + internalToUnix
}

Zone boundaries

lookup returns the timespan when the zone applies (start and end), used:

Currently, if the zone spans a new year, tzset returns the new year instead of one of the values, to limit the number of transition time calculations to two. This only affects efficiency in the first two cases, but in the last case it affects correctness.

If the optimisations above are applied, the following algorithm results in two transition time computations, except when second transition in the previous year occurs past the end of the year and past the target time, in which case (that never happens in Zoneinfo) it's three computations:

Optimisation: lookup

Avoid code duplication

Limitations

The proposed implementations of tzruleTime and lookup may return incorrect results in the following cases:

Resulting speed-up

I wrote benchmarks that load testdata/2020b_Europe_Berlin, create a Time value and run Hour in a loop. The Time is one of:

With optimisations above applied to master (commit 505dff4), the results are:

The benchmarks were run in an uncontrolled environment, so I can't give you more precise results.

Timezone abbreviations allocation

Change LoadLocationFromTZData and abbrevChars to allocate one string for all the chars except trailing NUL and cut abbrevs from it, instead of many strings of 3 to 6 bytes. Especially useful with locations having several zones with the same name (e.g., Europe/Dublin has three zones named "IST") and America/Adak that has "HST" encoded as a substring of "AHST".

ZONEINFO environment variable

If ZONEINFO is set, LoadLocation tries to load the named zoneinfo file from the path specified by it. This should probably be added to initLocal in src/time/zoneinfo_unix.go for consistensy. tzcode does not use this variable.

(Tentative) Optimisation: caching

When a location is loaded, the zone valid now is cached in the Location structure to be used in subsequent lookups. This is good for most uses, but in long running processes (such as servers) lookups will slow down after the next transition.

Tentative proposal:

Cache the last 1 or 2 lookup result as well. Alternatively, only cache last lookup results. Caching 2 last lookups is useful for conversions to UTC (e.g., in Date) around a transition; caching more will add too much overhead.

Downside: this will require a sync.RWMutex and taking a read lock on every lookup that misses the "now" cache. A compromise would be calling TryRLock and only writing back the result if locking succeeded.

This is useful in particular scenarios, such as a mail server serialising "now" for "Received:" headers, but detrimental in others.

unixdj commented 9 months ago

Code: https://github.com/unixdj/go/tree/tz