microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
10.1k stars 1.49k forks source link

`<xloctime>`: `get_time("%x")` can't parse what `put_time("%x")` generates with imbued locale #2641

Open StephanTLavavej opened 2 years ago

StephanTLavavej commented 2 years ago
C:\Temp>type meow.cpp
#include <ctime>
#include <iomanip>
#include <iostream>
#include <locale>
#include <sstream>
using namespace std;

string getDate() {
    tm t{};
    t.tm_mon  = 3; // April
    t.tm_mday = 1; // 1st
    t.tm_year = 122; // 2022
    ostringstream oss;
    oss.imbue(locale{"en-US"});
    oss << put_time(&t, "%x");
    return oss.str();
}

void parseDate(const string& date) {
    tm t{};
    istringstream iss(date);
    iss.imbue(locale{"en-US"});
    iss >> get_time(&t, "%x");

    cout << R"(get_time(&t, "%x") returned:)" << endl;
    cout << "months since January, [0, 11]: " << t.tm_mon << endl;
    cout << "    day of the month, [1, 31]: " << t.tm_mday << endl;
    cout << "             years since 1900: " << t.tm_year << endl;
}

int main() {
    const string s = getDate();
    cout << R"(put_time(&t, "%x"): )" << s << endl;
    parseDate(s);
}
C:\Temp>cl /EHsc /nologo /W4 /MTd meow.cpp
meow.cpp

C:\Temp>meow
put_time(&t, "%x"): 4/1/2022
get_time(&t, "%x") returned:
months since January, [0, 11]: 0
    day of the month, [1, 31]: 4
             years since 1900: 120

Root cause:

https://github.com/microsoft/STL/blob/52505b9d3249c50b33eb22152a1f18f208ea2959/stl/inc/xloctime#L475-L477

This is always using "%d / %m / %y" regardless of the locale. In addition to swapping the month and the day for the en-US locale, note that %y is a 2-digit year, so "2022" (t.tm_year = 122) is being parsed as "20" (t.tm_year = 120).

Originally reported as DevCom-1359934 (internal VSO-1289317 / AB#1289317 ) and DevCom-10002862 (internal VSO-1513676 / AB#1513676 ) - as Lewis Pringle noted, "This is a long-standing bug - not a regression."

StephanTLavavej commented 2 years ago

Capital "%X" (the locale-dependent time parsing specifier) is likely affected too.

fsb4000 commented 2 years ago

Maybe %r and %c are also affected.

libc++ finds full strings for the parsing specifiers via analysing some known date

MattStephanson commented 2 years ago

On the Discord, I brought up the possibility of calling GetLocaleInfoEx to get the locale's date format (LOCALE_SSHORTDATE for %x and %c, for example). I put together a proof on concept and then iterated over all the locales on my machine using EnumSystemLocalesEx. This revealed several cases where the date still doesn't round-trip, in three categories:

  1. Several locales (ar, fa, th, for example) use non-Gregorian calendars. I'm not aware of any public Win32 API functions to parse a date, so supporting these would involve reproducing the logic for these calendars in the STL, which doesn't seem viable to me.
  2. There are some locale names that GetLocaleInfoEx accepts, but that the UCRT in wsetlocale translates to other locales with different formats (specifically in _expandlocale and __lc_wcstolc). For examples, "bas" becomes "ba" and "uk" becomes "en-GB". I think we would have to either reproduce the UCRT logic that does this translation, or else expose the layout of the __crt_lc_time_data structure that has the actual UCRT locale and which the STL is currently holding as an opaque void* in _Timevec::_Timeptr.
  3. In ky and ky-KG, the short date format uses the abbreviated month name, which has non-ASCII characters.

Here's the full list of locales that failed on my machine, and a brief summary of the problem:

- ar: CAL_UMALQURA - ar-SA: CAL_UMALQURA - bas: converted to ba - fa: CAL_PERSIAN - fa-IR: CAL_PERSIAN - gsw: converted to gsw-FR - ku-Arab-IR: CAL_PERSIAN - ky: uses non-ASCII abbrv. month name - ky-KG: uses non-ASCII abbrv. month name - lrc: CAL_PERSIAN - lrc-IR: CAL_PERSIAN - mzn: CAL_PERSIAN - mzn-IR: CAL_PERSIAN - pap: converted to pa-Arab, reserved - prs: CAL_PERSIAN - prs-AF: CAL_PERSIAN - ps: CAL_PERSIAN - ps-AF: CAL_PERSIAN - sma: converted to sma-NO - smj: converted to smj-NO - th: CAL_THAI - th-TH: CAL_THAI - uk: converted to en-GB - uz-Arab: CAL_PERSIAN - uz-Arab-AF: CAL_PERSIAN - yi: converted to ii

The first two of these issues could be worked around by not calling strftime at all, but rather generating our own format string based on the GetLocaleInfoEx data. Then it would at least be self-consistent, but maybe there are some issues with legacy locale names or other things that we really do need the UCRT to handle. Presumably there's a reason that the UCRT is going to the trouble of translating the locale name in some cases. The issue with non-ASCII characters seems unavoidable in the char case unless we, in addition to generating our own format strings, also check for this situation and modify the format to use a numeric month.