python / cpython

The Python programming language
https://www.python.org
Other
63.48k stars 30.4k forks source link

strptime(.., '%c') fails to parse output of strftime('%c', ..) in some locales #53203

Open abalkin opened 14 years ago

abalkin commented 14 years ago
BPO 8957
Nosy @abalkin, @vstinner, @ezio-melotti
Dependencies
  • bpo-8915: Use locale.nl_langinfo in _strptime
  • Files
  • strptime-locale-bug.c: Working C code
  • strptime-locale-bug.py: Failing python code
  • cfmt.py
  • issue8957.py3k.1.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['extension-modules', 'type-bug', '3.7'] title = "strptime(.., '%c') fails to parse output of strftime('%c', ..) in some locales" updated_at = user = 'https://github.com/abalkin' ``` bugs.python.org fields: ```python activity = actor = 'belopolsky' assignee = 'none' closed = False closed_date = None closer = None components = ['Extension Modules'] creation = creator = 'belopolsky' dependencies = ['8915'] files = ['17601', '17602', '20358', '20412'] hgrepos = [] issue_num = 8957 keywords = ['patch'] message_count = 18.0 messages = ['107413', '125966', '125968', '126043', '126045', '126055', '126057', '126059', '126084', '126235', '126313', '126316', '126339', '126340', '126345', '126347', '126358', '221924'] nosy_count = 4.0 nosy_names = ['belopolsky', 'vstinner', 'ezio.melotti', 'rpetrov'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue8957' versions = ['Python 3.7'] ```

    Linked PRs

    abalkin commented 14 years ago

    The following code:

    import locale, time
    locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
    t = time.localtime()
    s = time.strftime('%c', t)
    time.strptime('%c', s)

    Raises

    ValueError: time data '%c' does not match format 'Mer 9 jui 16:14:46 2010'

    in any locale where month follows day in '%c' format. Note that attached C code works as expected on my OSX laptop.

    I wonder it it would make sense to call platform strptime where available? I wonder if platform support for strptime has improved since 2002 when _strptime.py was introduced.

    abalkin commented 13 years ago

    Adding bpo-8915 as a dependency because deducing D_T_FMT locale setting from strftime output seems impossible:

    >>> locale.nl_langinfo(locale.D_T_FMT)
    '%a %b %e %H:%M:%S %Y'
    abalkin commented 13 years ago

    Victor,

    You may be interested because your native language is implicated. :-)

    d8d5aad8-e55b-4500-a3a0-9ea982d771ff commented 13 years ago

    time.strptime(s, '%c' ) ?

    abalkin commented 13 years ago

    time.strptime(s, '%c' ) ?

    Oh my. It certainly took a long time to recognize a silly mistake!

    Thanks.

    abalkin commented 13 years ago

    My tests were wrong but the problem does exist. I am attaching a script that tests strptime(.., '%c') for all locales installed on my system (an unmodified US Mac OS X 10.6.6).

    The only failing locale that I recognize is Hebrew (he_IL). Eli, what do you think about this?

    
    $ ./python.exe cfmt.py 
    am_ET [ማክሰ ጃንዩ 11 18:56:18 2011] %A %B %d %H:%M:%S %Y != %a %b %e %H:%M:%S %Y
    et_EE [T, 11. jaan  2011. 18:56:18] %a, %d. %B %Y. %H:%M:%S != %a, %d. %b %Y. %T
    he_IL [EST 18:56:18 2011 ינו 11 ג'] %Z %H:%M:%S %Y %B %d %a != %Z %H:%M:%S %Y %b %d %a
    d8d5aad8-e55b-4500-a3a0-9ea982d771ff commented 13 years ago
    abalkin commented 13 years ago

    On Tue, Jan 11, 2011 at 7:26 PM, Roumen Petrov \report@bugs.python.org\ wrote: ..

    • locales with %A and %B are broken on this platform as %c is "Appropriate date and time representation (%c) with abbreviations"

    According to what standard? POSIX defines it as

    %c Replaced by the locale's appropriate date and time representation.

    http://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html

    and the manual page on my system agrees:

    %c is replaced by national representation of time and date.

    vstinner commented 13 years ago

    On Linux, cfmt.py fails on fr_FR locale (the only valid locale in the list of tested locales): --- fr_FR [mer. 12 janv. 2011 11:30:35 CET] %a %d %B %Y %H:%M:%S %Z != %a %d %b %Y %T %Z ---

    The problem is the month format: locale.nl_langinfo(locale.D_T_FMT) returns '%a %d %b %Y %T %Z', but _strptime (LocaleTime().LC_date_time) uses '%a %d %B %Y %H:%M:%S %Z' => '%b' vs '%B'.

    _strptime.LocalTime.__calc_date_time() uses strftime('%c') and then parse the output to get the complete format. But it uses strftime('%c') with the march month, and in french, march is formatted 'mars' for both month formats (%b *and* %B).

    _strptime.LocalTime.__calc_date_time() should detect that the month has the same format with %b and %B, and try other timestamps (other months).

    786d3f11-b763-4414-a03f-abc264e0b72d commented 13 years ago

    Alexander, I get the same error for the he_IL locale. Will look into this

    786d3f11-b763-4414-a03f-abc264e0b72d commented 13 years ago

    The problem for Hebrew appears to be the same as the one Victor stated for French. March in Hebrew is also a 3-letter word which means it's equal to its abbreviation.

    786d3f11-b763-4414-a03f-abc264e0b72d commented 13 years ago

    I'm attaching a patch for Lib/_strptime.py that handles the month differently in __calc_date_time. It cycles all months, trying to find one where the full and abbrev names are different and matches it against the timestamp created by strftime.

    This solution is a hack, but so is the whole __calc_date_time function :-) [IMHO]

    All tests pass and I also tried it manually with all the problematic locales reported by Alexander - seems to work correctly.

    If this looks OK to you guys I can commit and backport.

    abalkin commented 13 years ago

    On Sat, Jan 15, 2011 at 2:20 AM, Eli Bendersky \report@bugs.python.org\ wrote: ..

    This solution is a hack, but so is the whole __calc_date_time function :-) [IMHO]

    I am not sure how to proceed. On one hand, I opened this issue to demonstrate that the current implementation is flawed, on the other hand, Eli has succeeded in improving the hack so that we can live with it a bit longer. Note that I did not have any real life application that would misbehave because of this bug and I don't think developers expect %c format to be parseable in the first place.

    I made this issue depend on bpo-8915 because I think strptime should query the locale for format information directly rather than reverse engineer what strftime does.

    I don't think this fix solves all the problems. For example, in most locales (including plain C locale), day of the month in %c format uses %e format, but current implementation guesses it as %d:

    '%a %b %e %H:%M:%S %Y'
    >>> LocaleTime().LC_date_time
    '%a %b %d %H:%M:%S %Y'

    This does not seem to be an issue because strptime with %d seems to be able to parse space-filled as well as zero-filled numbers. However, there may be platforms that are less forgiving.

    On the patch itself:

    1. Unit tests are needed.

    2. Please don't use datetime as a local variable.

    3. I am not sure what the purpose of .lower() is. Are a_month and f_month lowercased?

    4. Please keep lines under 79 characters long.

    5. "for m in range(1, 13)" loop is better written as "for am, fm in zip(self.a_month, self.f_month)"

    Eli, what do you think yourself: should we try to perfect the hack or is it better to reimplement strptime using locale? Note that the latter may be a stepping stone to implementing strftime as well.

    abalkin commented 13 years ago
    1. datetime.find(self.f_month[m]) >= 0 -> self.f_month[m] in datetime

    Python is not C!

    786d3f11-b763-4414-a03f-abc264e0b72d commented 13 years ago

    Alexander,

    1) Patch comments - thanks for those. Will have them fixed. 2) General strategy for implementing strptime. I must confess I don't fully understand the reason for doing what the _strptime module does. Standard C AFAIK has nothing of the sort - it only has strftime and strptime, both using a given format string. Neither tries to guess it from an actual formatted time! Does it exist just to circumvent platforms where strptime isn't implemented in C or is buggy? Can you please shed some light on this (or point me somewhere)?

    With understanding of (2) I will be able to also logically reason about the next steps :-)

    5579dc13-9f48-42d1-bb17-9c003ef6fa70 commented 13 years ago

    You pretty much hit the nail on the head. Some platforms don't have strptime or did not have it at the time this code was written. The locale module is probably more recent than this code as well.

    786d3f11-b763-4414-a03f-abc264e0b72d commented 13 years ago

    Alexander, but still - this isn't just an implementation of strptime. strptime, AFAIU strptime gets the format string as a parameter and uses it to parse a date string into a "tm" struct. So why do we need to parse a date string *without* a format string in Python, resorting to heuristics and pseudo-AI instead?

    abalkin commented 10 years ago

    Eli,

    Given your last comment, are you still proposing your patch for inclusion or should we take the bpo-8915 approach?

    serhiy-storchaka commented 1 month ago

    I agree that using the OS strftime() or at least nl_langinfo() is more preferable, but these issues are already 14 years old. There may be issues:

    So I took the Eli's patch and added numerous tests for strftime() on different locales (they were written on another occasion).

    serhiy-storchaka commented 1 month ago

    I tested with many locales and found several cases in which this patch did not work:

    There are similar bugs related to ambiguity of the day of the week name. For example, in Kashubian, the month name matches the day of the week name for the sample datetime. In Yoruba, an abbreviated day names are used with suffix "ọjọ ́", and this matches the full day name, but not all full day name have such suffix.

    Many locales fail due to space-padded numbers or non-ASCII digits.

    I am not sure what of these bugs I'll fix here and what will leave for other issues.

    serhiy-storchaka commented 1 month ago

    For now, this PR fixes strptime() for %c and %x formats in multiple locales for the following languages: Arabic, Bislama, Chuvash, Estonian, French, Irish, Gurajati, Manx Gaelic, Hebrew, Hindi, Chhattisgarhi, Haitian Kreyol, Japanese, Kannada, Korean, Marathi, Malay, Norwegian, Nynorsk, Punjabi, Rajasthani, Tok Pisin, Yue Chinese, Yau/Nungon and Chinese.

    nineteendo commented 1 month ago

    Is this fixed now? Can it be closed?

    serhiy-storchaka commented 1 month ago

    Not yet. Several other locales can be fixed using other approach. I am working on this.

    serhiy-storchaka commented 1 month ago

    125406 fixes most of other locales that use non-ASCII digits.

    kulikjak commented 1 month ago

    Hi, testing all the merged changes on Solaris, I encountered two issues that I wanted to report:

    1. test_date_time_locale2 started failing now that year 1800 was added:
      File "/..../Lib/_strptime.py", line 436, in _strptime
        raise ValueError("time data %r does not match format %r" %
    ValueError: time data 'January  1, 1800 at 12:00:00 AM LMT' does not match format '%c'

    Based on my digging, I believe that the issue is in the LMT; I dumped the processed_format used for the check, and it doesn't include LMT (it ends with: (?P<Z>cest|gmt|utc|cet))

    I have no idea where that LMT comes from or whether it is Solaris specific though. It's there for old years:

    import locale
    import time
    locale.setlocale(locale.LC_TIME, "de_DE")
    print(time.strftime("%c", (1900, 1, 1, 0, 0, 0, 0, 1, 0)))
    print(time.strftime("%c", (1800, 1, 1, 0, 0, 0, 0, 1, 0)))
     1. Januar 1900 um 00:00:00 Uhr CET
     1. Januar 1800 um 00:00:00 Uhr LMT
    1. %x on Solaris returns only the last two digits of the year for many locales, meaning that test_date_locale2 often fails as it's not possible to distinguish for example 1900 from 2000.

    I don't think that the output of %x is standardized (Windows format is also different, although the year seems always 4 digit)? So it's possible that this might be an issue for other platforms as well.

    serhiy-storchaka commented 3 weeks ago

    Thank you @kulikjak.

    LMT is a standard thing. I think the result of the test depends on the place where they are run. I do not know what to do with LMT, so I am going to just ignore failures if the result contains LMT. This may be a part of a larger issue.

    %x on Solaris returns only the last two digits of the year for many locales

    What locales? We can skip this test on Solaris, but it would be better to keep it running for locales in which it works.

    kulikjak commented 3 weeks ago

    ... so I am going to just ignore failures if the result contains LMT. This may be a part of a larger issue.

    That makes sense. Thank you.

    What locales? We can skip this test on Solaris, but it would be better to keep it running for locales in which it works.

    From those being tested, _enUS, _deDE and _arAE do print only two digits with %x.