python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.04k forks source link

`datetime.strptime(dt.strftime("%c"), "%c"))` fails when year is <1000. #124529

Open pganssle opened 3 days ago

pganssle commented 3 days ago

Bug report

Bug description:

>>> from datetime import datetime
>>> datetime.strptime(datetime(1000, 1, 1).strftime("%c"), "%c")
datetime.datetime(1000, 1, 1, 0, 0)
>>> datetime.strptime(datetime(999, 1, 1).strftime("%c"), "%c")
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    datetime.strptime(datetime(999, 1, 1).strftime("%c"), "%c")
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nlx5/Documents/Programming/Python/cpython/Lib/_strptime.py", line 573, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
                                    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/nlx5/Documents/Programming/Python/cpython/Lib/_strptime.py", line 352, in _strptime
    raise ValueError("time data %r does not match format %r" %
                     (data_string, format))
ValueError: time data 'Tue Jan  1 00:00:00 999' does not match format '%c'

Discovered this when adding some hypothesis tests for strptime/strftime. I doubt this is a real problem anyone is going to have in the real world, but maybe.

I do not know if this is locale-specific or OS specific.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

terryjreedy commented 3 days ago

The year for datetime.datetime must be and is allowed to be anything in range MINYEAR <= year <= MAXYEAR, which is 1 <= year <= 9999. I expect that the format functions should handle any legal date.

zuo commented 2 days ago

Considering these results:

>>> datetime(999, 1, 1).strftime("%c")
'Tue Jan  1 00:00:00 999'

>>> datetime.strptime("Tue Jan  1 00:00:00 999", "%c")  # as from strftime() above => the error described above
[snip]
ValueError: time data 'Tue Jan  1 00:00:00 999' does not match format '%c'

>>> datetime.strptime("Tue Jan  1 00:00:00 999", "%c")  # adding 0 before 999 to have 4-digit width year => success
datetime.datetime(999, 1, 1, 0, 0)

...and the following fragment of the docs (https://docs.python.org/3/library/datetime.html#technical-detail):

  1. The strptime() method can parse years in the full [1, 9999] range, but years < 1000 must be zero-filled to 4-digit width.

...I am not sure if the proviso that years < 1000 must be zero-filled to 4-digit width intentionally covers also this case.

One could argue that it does, and there is nothing to fix here.

Another person, however, could argue that:

What do you think?

[EDIT] The quoted note refers to the %Y format code, not to the %c one. So I believe that that imaginary Another person would be right. :)

zuo commented 2 days ago

PS It seems that time.{strftime,strptime}() behave the same (as, apparently, it uses the same implementation from _strptime):

$ ./python
Python 3.14.0a0 (heads/main:a4d1fdfb15, Sep 26 2024, 22:47:21) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> t_tuple = time.strptime("Tue Jan  1 00:00:00 0999", '%c')
>>> t_tuple
time.struct_time(tm_year=999, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=1, tm_isdst=-1)
>>> time.strftime('%c', t_tuple)
'Tue Jan  1 00:00:00 999'
>>> time.strptime(_, '%c')
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    time.strptime(_, '%c')
    ~~~~~~~~~~~~~^^^^^^^^^
  File "/home/zuo/cpython/Lib/_strptime.py", line 567, in _strptime_time
    tt = _strptime(data_string, format)[0]
         ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/zuo/cpython/Lib/_strptime.py", line 352, in _strptime
    raise ValueError("time data %r does not match format %r" %
                     (data_string, format))
ValueError: time data 'Tue Jan  1 00:00:00 999' does not match format '%c'
zuo commented 2 days ago

Hypothesis

It seems that the source of the problem is that (at least typically – for the C.UTF-8 locale and at least some others, e.g. pl_PL.UTF-8; yet, it seems that also for any other locales...):

...whereas...

Observation

I checked that:

(1) When formatting that example year 999, the results are:

Function/Method For "%c" For "%Y"
time.strftime() "999" "999"
datetime.datetime.strftime() "999" "0999" [sic!]

Conclusion: datetime.datetime.strftime()'s %c formatting behaves like time.strftime(), therefore it is not based on datetime.datetime.strftime()'s formatting of %Y.

(2) When parsing that example year 999 (as well as, e.g., 9) – both as a part of full date (%c) and alone (%Y) – only the 4-digit year format is accepted. Smaller numbers of digits always cause the same ValueError from _strptime (whose machinery, as noted above, even for %c uses the %Y-specific stuff...).

Possible fix

In the _strptime module's machinery (which is used by datetime.datetime.strptime() and time.strptime()): decouple the %c's parsing regex from the %Y's one, making the former more liberal (accepting also 1, 2 or 3 digits in the year number). [The fix implementation would be made in the _strptime module, probably somewhere in LocaleTime.__calc_date_time()/TimeRE.__init__()... in TimeRE's __init__() and pattern()]

(Another theoretically possible variant: just make the %Y's regex more liberal – however that seems too disruptive...)

zuo commented 2 days ago

@pganssle @terryjreedy

I'd happy to implement the fix – if you decide that this should be fixed.

Mariatta commented 2 days ago

No issue on my Macbook laptop

Python 3.14.0a0 (heads/main:162d152146a, Sep 25 2024, 10:45:28) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datetime import datetime
>>> datetime.strptime(datetime(1000, 1, 1).strftime("%c"), "%c")
datetime.datetime(1000, 1, 1, 0, 0)
>>> datetime.strptime(datetime(999, 1, 1).strftime("%c"), "%c")
datetime.datetime(999, 1, 1, 0, 0)
>>> 
zuo commented 2 days ago

@Mariatta

Could you please check what string is returned on you system from the following call?

>>> datetime(999, 1, 1).strftime("%c")

Thanx :)

PS My guess is that, for your locale, a %c-formatted date+time includes a 2-digit year variant (instead of the 4-digit one).

Mariatta commented 2 days ago

@zuo I just tried it just now

Python 3.14.0a0 (heads/main:162d152146a, Sep 25 2024, 10:45:28) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datetime import datetime
>>> datetime(999, 1, 1).strftime("%c")
'Tue Jan  1 00:00:00 0999'
zuo commented 1 day ago

@Mariatta

Thank you!

Yeah, that leading zero your platform/locale provides makes strftime's %c format digestible by strptime on your system. Apparently, that's not the case for Linux family. :-/

Anyway, now it's quite clear for me what the fix should be.

zuo commented 17 hours ago

Proof of concept:

diff --git a/Lib/_strptime.py b/Lib/_strptime.py
index a3f8bb544d..6a2527b75c 100644
--- a/Lib/_strptime.py
+++ b/Lib/_strptime.py
@@ -213,8 +213,10 @@ def __init__(self, locale_time=None):
                                 'Z'),
             '%': '%'})
         base.__setitem__('W', base.__getitem__('U').replace('U', 'W'))
-        base.__setitem__('c', self.pattern(self.locale_time.LC_date_time))
-        base.__setitem__('x', self.pattern(self.locale_time.LC_date))
+        base.__setitem__(
+            'c', self.__pattern_with_lax_year(self.locale_time.LC_date_time))
+        base.__setitem__(
+            'x', self.__pattern_with_lax_year(self.locale_time.LC_date))
         base.__setitem__('X', self.pattern(self.locale_time.LC_time))

     def __seqToRE(self, to_convert, directive):
@@ -236,6 +238,21 @@ def __seqToRE(self, to_convert, directive):
         regex = '(?P<%s>%s' % (directive, regex)
         return '%s)' % regex

+    def __pattern_with_lax_year(self, format):
+        """Like pattern(), but making %y and %Y accept also fewer digits.
+
+        Necessary to ensure that strptime() is able to parse strftime()'s
+        output when the %c or %x format code is used -- considering that
+        for some locales/platforms (e.g., 'C.UTF-8' on Linux), formatting
+        with either %c or %x may cause year numbers, if a number is small,
+        to have fewer digits than usual (e.g., '999' instead of `0999', or
+        '9' instead of '0009' or '09').
+        """
+        pattern = self.pattern(format)
+        pattern = pattern.replace(self['y'], r"(?P<y>\d{1,2})")
+        pattern = pattern.replace(self['Y'], r"(?P<Y>\d{1,4})")
+        return pattern
+
     def pattern(self, format):
         """Return regex pattern for the format string.

[EDIT] After applying the above patch, the error does not occur anymore:

>>> import time
>>> t_tuple = time.strptime("Tue Jan  1 00:00:00 0999", '%c')
>>> t_tuple
time.struct_time(tm_year=999, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=1, tm_isdst=-1)
>>> time.strftime('%c', t_tuple)
'Tue Jan  1 00:00:00 999'
>>> time.strptime(_, '%c')
time.struct_time(tm_year=999, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=1, tm_isdst=-1)
>>> 
>>> from datetime import datetime
>>> datetime(999, 1, 1).strftime('%c')
'Tue Jan  1 00:00:00 999'
>>> datetime.strptime(_, '%c')
datetime.datetime(999, 1, 1, 0, 0)