prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.89k stars 5.32k forks source link

Presto allows parsing of invalid timezone names #21461

Open pedroerp opened 9 months ago

pedroerp commented 9 months ago

Presto's timezone parsing is too permissive and allows for timezone names that are not valid according to IANA's timezone database, in addition to other bugs:

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

The following are examples of non-compliant timezone names that are allowed:

ETC/+06:00
ETC/+06
ETC/+6
ETC/UTC*
ETC/UT*

-> the only official formats are "ETC/GMT+1" and similar

EST

-> not supported, but it should be

timezones in the format:

ETC/GMT+10:00

besides not being official, return the wrong result (don't flip the sign as they should)

+1
+01

officially also don't exist (only "+01:00")

Expected Behavior

These should fail to parse, though it might break backwards compatibility.

Current Behavior

They are parsed successfully (incorrectly).

Steps to Reproduce

presto> select from_unixtime(1698528090, 'ETC/+06:00');
             _col0              
--------------------------------
 2023-10-29 03:21:30.000 +06:00 

and other variations described above.

Cc: @mbasmanova @zacw7

mbasmanova commented 9 months ago

CC: @kaikalur @tdcmeehan @aditi-pandit @majetideepak

tdcmeehan commented 9 months ago

Though this breaks backwards compatibility, these timezones seem obscure and it may be acceptable to just warn folks that such timezones will no longer be supported via a release note.

The class that has this bug is TimeZoneKey. The normalizeZoneId method in this class appears to have some bugs, and it appears it is overly permissive in its parsing of etc timezones.

We could also fix up our documentation to say explicitly what values are allowed for the timezone. AFAIA, these are tz identifier values, and GMT is such a value whereas EST is not.

pedroerp commented 9 months ago

AFAIA, these are tz identifier values, and GMT is such a value whereas EST is not

I thought it was weird as well, but it's listed here as a valid timezone name:

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

tdcmeehan commented 9 months ago

I think the Wikipedia article is mixing up abbreviations with canonical zone IDs. To be fair, they don't appear cleanly delineated in any documentation I can find from IANA, and zdump happily interchanges these abbreviations with canonical ids.

Still, I think it's worthwhile to update the documentation to note abbreviations besides UTC and GMT are not allowed. Firstly, they're sometimes ambiguous (CST may mean a timezone in China or in North America, even EST may sometimes be referred to as an Australian timezone in addition to a North American timezone). Secondly, we use JDK APIs, and three-letter zone ids are explicitly deprecated. The work required to support short IDs just doesn't seem worthwhile given the effort needed and relatively low ROI.

Technically, we do presently allow some obscure short names (such as ROK, PRC, WET). We can leave these undocumented and in the future disallow them. BTW, it's easy to identify an abbreviation: if it's 3 or 4 characters, it's an abbreviation. GMT and UTC are notable exceptions. Because it's easy to identify them, it's easy to document that they're not supported.