python / cpython

The Python programming language
https://www.python.org
Other
63.88k stars 30.57k forks source link

ZoneInfo inconsistent case sensitivity on OS-freebsd vs OS-mac #115022

Open ilankessler opened 10 months ago

ilankessler commented 10 months ago

Bug report

Bug description:

ZoneInfo case sensitive on OS-freebsd and case insensitive on OS-mac.

OS-mac

Python 3.12.1 (v3.12.1:2305ca5144, Dec  7 2023, 17:23:38) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from zoneinfo import ZoneInfo
>>> ZoneInfo("AUSTRALIA/SYDNEY")
zoneinfo.ZoneInfo(key='AUSTRALIA/SYDNEY')

OS-freebsd

Python 3.12.1 (main, Feb  1 2024, 03:58:05) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from zoneinfo import ZoneInfo
>>> ZoneInfo("AUSTRALIA/SYDNEY")
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/zoneinfo/_common.py", line 12, in load_tzdata
    return resources.files(package_name).joinpath(resource_name).open("rb")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/importlib/resources/_common.py", line 46, in wrapper
    return func(anchor)
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/importlib/resources/_common.py", line 56, in files
    return from_package(resolve(anchor))
                        ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/importlib/resources/_common.py", line 82, in _
    return importlib.import_module(cand)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1324, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'tzdata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/zoneinfo/_common.py", line 24, in load_tzdata
    raise ZoneInfoNotFoundError(f"No time zone found with key {key}")
zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key AUSTRALIA/SYDNEY'

Happy to raise a PR to fix

CPython versions tested on:

3.12

Operating systems tested on:

Linux, macOS

hugovk commented 10 months ago

Are you saying it works in both cases for ZoneInfo("Australia/Sydney")?

Is this not a bug, but really the result of different operating systems have different filesystem sensitivity rules?

For one thing, the examples at https://docs.python.org/3/library/zoneinfo.html are written in sentence case, like ZoneInfo("Australia/Sydney").

And:

When ZoneInfo(key) is called, the constructor first searches the directories specified in [TZPATH](https://docs.python.org/3/library/zoneinfo.html#zoneinfo.TZPATH) for a file matching key, and on failure looks for a match in the tzdata package.

ilankessler commented 10 months ago

@hugovk,

Thanks for clarifying - correct, it is a result of different operating systems having different filesystem sensitivity rules... so it may not be technically a bug per the specs you shared.

However this inconsistency doesn't exist in pytz and was unexpected when we migrated.

ronaldoussoren commented 10 months ago

Current behaviour matches system behaviour on both platforms, e.g. on macOS:

% TZ=Europe/Helsinki date
Mon Feb  5 11:24:48 EET 2024

% TZ=EUROPE/HELSINKI date
Mon Feb  5 11:24:58 EET 2024

On BSD and Linux the latter doesn't work. The difference is that macOS has a case insensitive filesystem and BSD and Linux don't. It is possible to emulate a case sensitive file system on macOS (similarly to what we do in the import system), but that feels like overkill here.

BTW. What is the behaviour you expect to see and why?

cc @pganssle

ilankessler commented 10 months ago

BTW. What is the behaviour you expect to see and why?

The expected behaviour was consistency, i.e. ZoneInfo("EUROPE/HELSINKI") to either fail or succeed on both platforms

The main concern stems from the practical side of maintaining cross-platform compatibility, especially noticeable during our migration from pytz to ZoneInfo where we had different behaviour in our dev vs prd environments.

WarpedPixel commented 9 months ago

Is there any value in the time zone database being case sensitive (in some cases)?

I'd like to be able to programmatically normalize a key, if these platform differences will stay the way they are. For example, if code runs correctly with key EUROPE/london on MacOS, it's no guarantee that it will work on Windows (because there is no database in windows, fixable by installing tzdata) or on most Unix systems (due to case sensitive FS). I can't really test it and the code can't fix this data on MacOS. Note that the test if a timezone is valid will pass on MacOS in the example, but fail on Unix even if the same tzdata is installed.

The only solution seems to be to search through the list of all time zones for the name with the proper case, which opens a ton of files.

pganssle commented 9 months ago

I can't think of any problems that are realistically likely to arise from making this case-insensitive, but I'm also very hesitant to do it because, as far as I know, there's no spec that says that TZif file names should be treated as case-insensitive, and I'm always hesitant to lock in guarantees that are not in turn guaranteed by the APIs we're calling or the specs we're using. There was briefly a fork of tzdata over just such an assumption, where I believe Java was assuming that DST was defined as "whichever offset is closer to positive infinity" or something of that nature, and then certain zones started being modeled with "winter time" as the DST zone (for sound reasons, mind you). Similarly, we're already in a bit of a bind because the datetime.dst() function promises to give more data than is strictly available in the TZif files — something we need to work around using heuristics; this probably delayed the creation of the zoneinfo module by at least 5 years!

I also think it would be a bit tricky to pull it off, and would involve some trade-offs that I would rather leave to the user. I think the way to implement this would be to walk the search tree like available_timezones() does, then build a dictionary mapping the casefolded versions of each zone to either the canonical name or the canonical location. available_timezones will open all the files it finds, to make sure that they are actually TZif files, but I believe we would be able to avoid doing so because we can defer the check until someone actually requests the file in this case (since this lookup table isn't being used to determine if a given zone exists, just to effectively create a fake case-insensitive filesystem). The main trade-off here is whether we want to cache the normalization tree or not. If we have to build the whole tree every time a new zone is created, that could get expensive, but if we cache it, we might miss changes if zones are added or removed during the run of the program (though with the existing caching behavior maybe that's fine?)

The other thing weighing against this is that these zone names are not supposed to be user-facing anyway, so you should expect that they're generated by looking them up in a pre-normalized table or something of that nature. In reality, I am aware that everyone creates bad UIs around this kind of thing (particularly because it's much easier than creating a good UI around it), but I am not sure we want to be making that even easier.

The strongest case for action here is basically that we do have a "privileged" position here with respect to the search path, because we haven't yet exposed an API for end-users to get that search path. Without that, in order to build a case-insensitive mapping, they have to re-implement all the logic to figure out the order that Python is going to use to find zones, or they have to call available_zones() and incur the cost of opening every time zone file on the system.

So I would recommend that we either:

  1. Get around to exposing the search path and then recommend that if you want a case-insensitive time zone lookup, you can implement it like that.
  2. Expose a function that maps strings to zone names in a case-insensitive manner for people who want to do that.

I've wanted to do the first one for ages, so obviously I'm +1 on that. I am probably more +0 on the second one.

WarpedPixel commented 9 months ago

Alternatively, we could make it case sensitive in all platforms. That would also work. If it breaks on Mac it breaks everywhere, so it would be testable. Or if the .key returned a canonical key case for a found timezone (return America/New_York on Mac regardless of how it was arrived at).

pganssle commented 9 months ago

Thinking about the caching and lookup a bit more, I realized that it's probably not as bad as I thought at first, if we optimize for the "happy path".

If we did build this in to the ZoneInfo constructor, we could construct a zone as normally and wait until that fails before trying to do any normalization. In the vast majority of cases, we would incur no cost, and if we find that this is a case-changes-only variant of a known string, we can add an entry into the weak reference cache for the case-folded version of the string.

Additionally, the problem with the zone files changing during the run of a program would also just lead to cache invalidation in the unhappy path, because we can invalidate the cache if the zone name doesn't exist in the cache or if the file is no longer present. The downside to that approach would be that in the (likely to be somewhat common) case where someone passes an invalid zone name to the function, we'd need to rebuild at least some sections of the cache every time. Of course that is basically an error case, so it's probably not happening in a tight loop unless someone is doing something very specific like trying to validate a large number of time zone strings in a case-insensitive manner — though of course if they want to do that, it's probably fair to ask them to call available_timezones() and use whatever caching and lookup behavior they want.

Alternatively, we could make it case sensitive in all platforms. That would also work. If it breaks on Mac it breaks everywhere, so it would be testable. Or if the .key returned a canonical key case for a found timezone (return America/New_York on Mac regardless of how it was arrived at).

These both seem like better solutions to the platform-dependent problem, though again it still runs up against the biggest objection I have, which is that the zone files themselves are not guaranteed to be case-insensitive. I don't know what would happen on case-insensitive file systems if Europe/LONDON represented a different zone than Europe/London, and I also don't know if there is some weird system out there relying on being able to distinguish between things like this (e.g. "Oh no you broke our system where we check for changes to time zones by always deploying version N-1 to an all-caps version of the zoneinfo hierarchy and comparing results!"). I certainly hope that's not a real thing that's happening out there, maybe we can ask the IANA tz mailing list for comment.

WarpedPixel commented 9 months ago

I certainly hope that's not a real thing that's happening out there, maybe we can ask the IANA tz mailing list for comment.

That would be wise.

IANA's guidance on case dependency says:

Do not use names that differ only in case. Although the reference implementation is case-sensitive, some other implementations are not, and they would mishandle names differing only in case.

My solution for now is to NOT validate case and hope for the best. If the wrong thing is ever written to the database I can run a slow process to fix it.

WarpedPixel commented 9 months ago

I just wanted to provide some color on why this matters. The scenario is a Multiplatform app with a backend DB. Obviously we would all be better off if case was enforced everywhere, or nowhere.

  1. Running on Mac, I call an API, for example a geolocation API that returns a timezone ID. Hopefully a good one with precisely the right capitalization. The ID is tested on my machine and it works. I persist it to the DB.
  2. Running on a Unix flavor, I load that up and it turns out that capitalization was wrong. The value in the DB is wrong and I have no simple way to normalize the case and fix it (which I would have done on the Mac to begin with before committing a bad value to the DB). Use on this machine of the timezone ID fails.
ilankessler commented 9 months ago

@WarpedPixel and @pganssle thanks for your response

If we did build this in to the ZoneInfo constructor, we could construct a zone as normally and wait until that fails before trying to do any normalization

This was what I was thinking in terms of implementation - let me know if you want me to raise a PR 👍