php / php-src

The PHP Interpreter
https://www.php.net
Other
37.58k stars 7.7k forks source link

DateTime: Z not recognised as UTC (+00:00) #14593

Open marc-mabe opened 3 weeks ago

marc-mabe commented 3 weeks ago

Description

The character Z (and to my knowladge z as well) is known in ISO8601 to represent the UTC timezone aka +00:00 but DateTime[Immutable] does not detect this as UTC or +00:00. Instead the latter Z is used as timezone name. Interestingly DatePeriod::createFromISO8601String does recognize it as +00:00 correctly.

Additionally there is a note in wikipedia:

The Z suffix in the ISO 8601 time representation is sometimes referred to as "Zulu time" or "Zulu meridian" because the same letter is used to designate the Zulu time zone.[30] However the ACP 121 standard that defines the list of military time zones makes no mention of UTC and derives the "Zulu time" from the Greenwich Mean Time[31] which was formerly used as the international civil time standard. ...

... which makes me wonder if the recognized timezone Z is actually Zulu time or something unknown that falls back to UTC?

The following code: https://3v4l.org/u2fmJ#v8.3.8

<?php
date_default_timezone_set('Europe/Berlin');
$dt = new DateTimeImmutable('2008-03-01T13:00:00Z');
var_dump($dt, $dt->getOffset(), $dt->modify('+6month'), $dt->modify('+6month')->getOffset());

var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sT', '2008-03-01T13:00:00Z'));
var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sO', '2008-03-01T13:00:00Z'));
var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sP', '2008-03-01T13:00:00Z'));

$dp = DatePeriod::createFromISO8601String('R1/2008-03-01T13:00:00Z/P1Y2M10DT2H30M');
foreach ($dp as $d) {
    var_dump($d);
}

Resulted in this output:

object(DateTimeImmutable)#1 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(2)
  ["timezone"]=>
  string(1) "Z"
}
int(0)
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-09-01 13:00:00.000000"
  ["timezone_type"]=>
  int(2)
  ["timezone"]=>
  string(1) "Z"
}
int(0)
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(2)
  ["timezone"]=>
  string(1) "Z"
}
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(2)
  ["timezone"]=>
  string(1) "Z"
}
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(2)
  ["timezone"]=>
  string(1) "Z"
}
object(DateTimeImmutable)#6 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(6) "+00:00"
}
object(DateTimeImmutable)#8 (3) {
  ["date"]=>
  string(26) "2009-05-11 15:30:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(6) "+00:00"
}

But I expected this output instead:

object(DateTimeImmutable)#1 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(1) "+00:00"
}
int(0)
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-09-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(1) "+00:00"
}
int(0)
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(1) "+00:00"
}
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(1) "+00:00"
}
object(DateTimeImmutable)#2 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(1) "+00:00"
}
object(DateTimeImmutable)#6 (3) {
  ["date"]=>
  string(26) "2008-03-01 13:00:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(6) "+00:00"
}
object(DateTimeImmutable)#8 (3) {
  ["date"]=>
  string(26) "2009-05-11 15:30:00.000000"
  ["timezone_type"]=>
  int(1)
  ["timezone"]=>
  string(6) "+00:00"
}

PHP Version

PHP 8.3

Operating System

No response

heiglandreas commented 3 weeks ago

I would expect indeed in the first 4 cases the timezone part to be exactly what it is. I am passing in "Z" as timezone and therefore the internal timezone identifier is "Z"

The main question to me though is: Why do you care about the internals of a ValueObject?

After all DateTimeImmutable::format() will provide you with the 00:00 when passing in O or P.

The to me unexpected result is what comes from the DatePeriod call as I expected that to have Z internally set as well. But again: I don't really care how that is internally handled.

And to perhaps clear one thing:

The character Z (and to my knowladge z as well) is known in ISO8601 to represent the UTC timezone aka +00:00

There is a timezone Z, there is a "timezone" UTC and there is an offset 00:00. And all three are something unique.

The timezone Z has an offset against UTC of 00:00 but it is not the same as UTC or 00:00.

Similarily as CET is not the same as +01:00 or MST is not the same as -07:00.

marc-mabe commented 3 weeks ago

Hi @heiglandreas,

The time zone Z refers to the military Zulu time zone but in in case of ISO 8601 Z is the zone designator for the zero UTC offset, which is a different thing even if both represent zero offset to UTC.

Z is used very often in web as JSON.stringify(new Date()); uses Z which refers to the ISO 8601 meaning of +00:00 and not to the the military Zulu time zone.

The main question to me though is: Why do you care about the internals of a ValueObject?

Because it's not only internals. I do use $dt->getTimeZone()->getName() as we have to store the time zone identifier in a database and in case of time zone identifier we only allow a subset of possible identifiers as all these must be supported by other systems as well (mainly JS, Java, python, Go, R and MySQL, Postgres).

heiglandreas commented 3 weeks ago

According to Wikipedia (Feel free to pay me the CHF 175 for the real McCoy 😉) You can use Z but it is perfectly valid to also use +00:00.

Depending on the concrete format it is though required (Like for PDF timestrings).

In those cases using DateTimeInterface::format('p') is your friend.

For the $dt->getTimezone()->getName() case: Depending on what you are getting you will anyhow need to have a translation matrix from one format into another one. That becomes especially relevant when working with different inputs that are possibly not using the tzdb (Like for example windows!). And even then depending on which files from the tzdb are used you might need a translation matrix to make sure that historical data is handled correctly.

But that is not a problem of the internal state of the value-object but of your code not making sure that whatever comes out of the value-object matches your expectations.

If you are only working with named offsets like Z, UTC, CET or PST (Where still is not clear whether that is Philippine or Pacific Standard Time) the easiest for moving them between systems is anyhow using the P or p format option. In that case you are not depending on the internal state of the object and are fully transparent. Any you are anyhow not loosing any information as you only had the offset previously anyhow.

marc-mabe commented 3 weeks ago

I don't have the raw ISO standard either (a shame something like this isn't public available).

Formatting such a DateTime object isn't the issue. And yes, I need a translation from given time zone to known/supported timezone or fail.

What I want to say is that such translation shouldn't be needed on parsing ISO 8601 date-times as this doesn't contain a time zone but a time zone offset only. You can not use a timezone identifier or abbreviation here just an offset and Z is a short cut for +00:00 in this case and not a time zone.

damianwadley commented 3 weeks ago

ISO 8601 (well, Wikipedia) doesn't say that the "Z" is a timezone. It says that you use the letter "Z" if the time is UTC. It's a subtle difference, but a meaningful one: "Z" doesn't function as a timezone name but as a character in the string that signifies UTC.

These

$dt = new DateTimeImmutable('2008-03-01T13:00:00Z');
var_dump($dt, $dt->getOffset(), $dt->modify('+6month'), $dt->modify('+6month')->getOffset());

are going to parse the "Z" as a timezone identifier, and that timezone will be preserved. Similarly,

var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sT', '2008-03-01T13:00:00Z'));
var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sO', '2008-03-01T13:00:00Z'));
var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sP', '2008-03-01T13:00:00Z'));

will too because you told it explicitly that the token after the time would be a timezone.

On the flip side, the work with DatePeriod

$dp = DatePeriod::createFromISO8601String('R1/2008-03-01T13:00:00Z/P1Y2M10DT2H30M');
foreach ($dp as $d) {
    var_dump($d);
}

also makes sense because in this case you're following the standard where "Z" is a symbol that means UTC, and so DatePeriod is taking that and creating regular UTC/+00:00 times.

In other words, the current behavior seems right to me: it's "Z" when the timezone is an actual identifier, but "+00:00" when there is no timezone identifier but it's known to be UTC.

heiglandreas commented 3 weeks ago

At least with

DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sp', '2008-03-01T13:00:00Z');
// note the lower case `p` at the end

I would expect the Z to be interpreted as +00:00

https://3v4l.org/Gvhpc#v8.3.8

In the other cases I would even be fine with the type 2 timezone, but in that specific case we are explicitly saying that - when used for formatting - we want an offset of 00:00 to be converted into Z so I'D expect the reverse to happen when using that identifier for creating from format.

In how far the DatePeriod::createFromIsoString does something completely different and possibly converts the DateTime internally is a separate question.

marc-mabe commented 3 weeks ago

@damianwadley

Your explanation makes sense in a way but it would also means the following shouldn't identify Z as a timezone because I explicitly say I'm parsing ISO string:

$dt = DateTime::createFromFormat(DateTime::ISO8601, '2000-01-01T00:00:00Z');
var_dump($dt->getTimeZone()->getName()); // string(1) "Z"

$dt = DateTime::createFromFormat(DateTime::ISO8601_EXPANDED, '2000-01-01T00:00:00Z');
var_dump($dt->getTimeZone()->getName()); // string(1) "Z"
damianwadley commented 3 weeks ago

At least with

DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sp', '2008-03-01T13:00:00Z');
// note the lower case `p` at the end

I would expect the Z to be interpreted as +00:00

How so? "p" isn't listed in the createFromFormat docs, but that's probably an oversight and I'd expect it works like the others: the corresponding token is a timezone. And that's all that's supported - there's no differentiation in here between the different types of timezones.

edit: Indeed. https://github.com/php/doc-en/issues/3458

Your explanation makes sense in a way but it would also means the following shouldn't identify Z as a timezone because I explicitly say I'm parsing ISO string:

Actually no, what you're doing is using the format strings "Y-m-d\\TH:i:sO" and "X-m-d\\TH:i:sP", whose values you happened to get by looking up a couple constants defined on the DateTime class, and those constants happened to go by the names "ISO8601" and "ISO8601_EXPANDED". https://github.com/php/php-src/blob/master/ext/date/php_date.c#L157-L185

heiglandreas commented 3 weeks ago

the corresponding token is a timezone.

The corresponding token in that case is a special case of an offset and should therefore be treated as an offset.

While looking through the code I noticed that at least the DATE_FORMAT_RFC3339 is broken as it seems to not use the Zat all...

https://github.com/php/php-src/blob/ac947925c0f2e6d8733b530179fb4ed465918f11/ext/date/php_date.c#L136-L155C1

See https://3v4l.org/B5pBY#v8.3.8

But changing that will be a BC-break...

Also:

Your explanation makes sense in a way but it would also means the following shouldn't identify Z as a timezone because I explicitly say I'm parsing ISO string:

Please check the documentation regarding the ISO strings: https://www.php.net/manual/en/class.datetimeinterface.php#datetimeinterface.constants.iso8601

Note: This format is not compatible with ISO-8601, but is left this way for backward compatibility reasons. Use DateTimeInterface::ISO8601_EXPANDED, DateTimeInterface::ATOM for compatibility with ISO-8601 instead. (ref ISO8601:2004 section 4.3.3 clause d)

damianwadley commented 3 weeks ago

The corresponding token in that case is a special case of an offset and should therefore be treated as an offset.

Sure, as humans we can look at it and know what was meant, but how is PHP (coughtimelib) supposed to know whether Z in a string means UTC+0 or the military timezone known as "Z"? It can't. All it has to work with is a format string that says "token is a timezone" and an input sequence that reads as "Z".

There is one function in here that has any special knowledge about ISO 8601: the DatePeriod stuff. And that handles it correctly.

Kinda feel like what we're aiming for here is a distinction between the different timezone types/names/identifiers/formats/whatever, not the one-size-fits-all arrangement in place now.

heiglandreas commented 3 weeks ago

That is why I was explicitly saying

In the other cases I would even be fine with the type 2 timezone, but in that specific case we are explicitly saying that - when used for formatting - we want an offset of 00:00 to be converted into Z so I'D expect the reverse to happen when using that identifier for creating from format.

In this specific (lower case p) case we use either the Z or the [+-]xx:yy and interpret it. Having an if (thingInQuestion === 'Z') thinginQuestion = '+00:00'; should be possible. Because we already at that point know that thingInQuestion can only beZ` or an offset in a given format.

This is only when using a createFromFormat! Not when using the magic in the constructor!

damianwadley commented 3 weeks ago

Question, then: what about the other formats that work the same way?

Should D only support three-letter days and l only support the longer names? How about M for short month names and F for long names? Should a and A distinguish between lowercase and uppercase? And how about numbers? In formatting, j / n / g will never have leading zeroes, so does that mean they should never accept leading zeroes when being parsed?

If so, second question: should there be a way to choose to use the current behavior? I'd think so: the flexibility with names and numbers can be very useful.

heiglandreas commented 3 weeks ago

When Importing... Probably.

But in this special case that we are talking about there is no disambiguity. When explicitly using p to create a datetime from a defined format I am not interested in a possible Zulu timezone. Otherwise I would expect

var_dump(DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sp', '2008-03-01T13:00:00K'));

to return a datetime-object with a K timezone....

🙈 https://3v4l.org/8jfMi#v8.3.8

Which it does....

Which is clearly broken as I explicitly said I want to have an offset and in case of the offset being explicitly Z I want that to be treated as 00:00...

Or, should the p not be supported, I would expect the parser to fail as noted in the docs:

Unrecognized characters in the format string will cause the parsing to fail and an error message is appended to the returned structure. You can query error messages with DateTimeImmutable::getLastErrors().

damianwadley commented 3 weeks ago

But in this special case that we are talking about there is no disambiguity.

That's right, it would be a special case: four timezone formatting characters, all of which have different (but possibly overlapping) actual meanings when formatting but the same meaning when parsing, and then one more timezone character that has a different meaning when formatting and also a different meaning when parsing?

How about this: add a strict mode for parsing.

DateTime::createFromFormat(
    string $format,
    string $datetime,
    ?DateTimeZone $timezone = null,
    bool $strict = false
)

When disabled, all specifiers work as they do now and will accept (almost) any value for that type of time unit, regardless of what the character actually means when it's used during formatting. "p" continues to accept anything that looks like a timezone, just like "P" and "O" and "T" and "e" do.

When enabled, the specifiers are more precise and only accept the types of values that could be emitted during formatting. If "p" can only produce an HH:MM or Z while formatting then it also accepts only those when parsing; additionally, +00:00 is not supported and Z is translated to mean UTC+0. For another example, "O" is HHMM and does not accept a colon, while "P" is HH:MM and requires a colon.

In strict mode, "p" will parse Z as UTC+0 while "T" will parse it as the Zulu military timezone.

A strict mode keeps backwards compatibility and allows developers to opt into more precise behavior if/when they have the knowledge that it's safe to do so. It's like the same sort of difference as between strtotime/the parsing magic for unknown strings vs. using createFromFormat for known strings - except without needing a separate set of functions.

Or, should the p not be supported,

I checked the parser and "p" is supported, it's just not listed in the docs. Made a bug report for it earlier.


Fun fact: I just noticed that the "Z" military timezone is not technically the same as UTC+0. Apparently 1 2, the military timezones use a fixed 15° range by-longitude, where Z is the range -7.5° to 7.5°. And since standard geopolitical timezones aren't as strictly delineated, there are going to be places in the world where their military timezone is one UTC offset but their (non-DST) "regular" timezone is a different offset.

marc-mabe commented 3 weeks ago

Also noted that DateTimeZone::listAbbreviations() contains nearly every single alpha character.

Changing the example DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sp', '2008-03-01T13:00:00K') to DateTimeImmutable::createFromFormat('Y-m-d\\TH:i:sp', '2008-03-01T13:00:00J') fails as j isn't in the list but k is. No idea where all these time zones are coming from.

The Zulu timezone is defined in this list as Etc/Zulu within utc timezones. z on the other hand is listed there as it's own entry with only one item with ["timezone_id"]=> NULL. I wonder if this really refers to the Zulu timezone.

damianwadley commented 3 weeks ago

Those are military timezones. A-Z are all used, with 24 letters used for 24 different UTC±12 zones east/west of 0° longitude, "J" used for the "local" timezone (which is basically useless for anything we'd care about, which is why it's not supported as a timezone), and of course "Z" uses UTC+0.

Etc/Zulu is one of the deprecated timezone names so please ignore it.

I wonder if this really refers to the Zulu timezone.

Depends what you mean by "Zulu". That's a phonetic word for "Z", which could mean either UTC+0 or the Z military timezone, depending on the context. Yes, there really are two different meanings. In fact, this whole issue was originally because of the confusion between those two meanings, but now we're trying to resolve the related problem of how various characters like "p" work when formatting compared to a different behavior when parsing.

heiglandreas commented 3 weeks ago

I'd be very much in favour of such a "strict" parsing mode. And I'd even encourage to set that at one point in time (pun intended) as the default mode.

I might even consider the last two steps as negotiable 😉

marc-mabe commented 3 weeks ago

I do like a strict mode as well to be able to strictly expect one specific input format but I don't think a strict only mode is sufficient as there must be a way to parse am/pm and AM/PM simultaneously.

Furthermore I still think format specifiers as O, P, p, Z should parse time zone offset only and fail if it's not an offset (with the special case of p parsing Z as +00:00).

The documentation needs to be improved as well as it doesn't reflect existing behavior in all cases (like a and A Ante meridiem and Post meridiem am or pm is missing upper case example). In case of strict mode the documented parsing format identifiers needs to be split to reflect the different behavior.

If the strict mode only refers to a migration period of fixing time zone offset related specifiers only - I'm not sure it's worth it. For me it's still a bug that needs to be fixed maybe not in a patch version but why adding an additional parameter to support something that shouldn't have behaved this way in the first place?

marc-mabe commented 3 weeks ago

This is what I could think of to fix the bug : https://github.com/marc-mabe/php-src/commit/f8bdfe4319de4026d9dbb3dc6a58396cc450c05b