microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
9.95k stars 1.46k forks source link

`<iomanip>`: `std::put_time` should copy unknown conversion specifiers instead of crash #4820

Open TheStormN opened 2 weeks ago

TheStormN commented 2 weeks ago

Describe the bug

I know that behind the scenes std::put_time does use strftime which leads to crash if some kind of invalid input is provided. I haven't exactly read the standard, but the cplusplus.com docs on the function says if anything wrong happens, it should set the failbit on the stream and not crash the program. While undefined behavior is allowed for strftime on invalid input, this does not seems to be the case for std::put_time and this is why I'm reporting this as a bug.

Now, some ideas from my research on the topic. According to MS docs the only way to recover from strftime error case is to use _set_invalid_parameter_handler/_set_thread_local_invalid_parameter_handler https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/set-invalid-parameter-handler-set-thread-local-invalid-parameter-handler?view=msvc-170

The first one seems unfeasible as it will modify the global CRT invalid handler for C functions. The second option about thread local storage might be a better option if set and unset before/after calling std::put_time and using it to properly flag the stream error. This would still trigger an assert in debug mode, but at least it will not crash release versions.

But in general, the best solution for me would be for the UCRT team to modify the existing strftime implementations or create a completely new variant which can handle errors in a more graceful manner, but if this is not possible at least the STL team should try to handle this case somehow by perhaps using the workaround I've proposed.

Command-line test case

C:\Temp>type repro.cpp
#include <iostream>
#include <iomanip>
#include <sstream>
#include <ctime>

int main() {
    std::time_t t = std::time(0);
    std::tm* now = std::localtime(&t);
    std::ostringstream output;
    output << std::put_time(now, "%E%J%P"); // some random invalid formatting string

    if(output.good()) {
        std::cout << "All good\n";
    } else {
        std::cout << "Date time formatting failed due to invalid input\n";
    }
}

C:\Temp>cl /EHsc /W4 /WX /std:c++20 .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.36.32546 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

repro.cpp
Microsoft (R) Incremental Linker Version 19.36.32546.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:repro.exe
repro.obj

C:\Temp>.\repro.exe
!!!App Crashed!!!

Expected behavior

That would be for the program to NOT crash and output "Date time formatting failed due to invalid input".

(Edited: See analysis below.)

STL version

VS 2022 17.6.16

frederick-vs-ja commented 2 weeks ago

While undefined behavior is allowed for strftime on invalid input

See WG14-N3220 7.29.3.5/2:

The strftime function places characters into the array pointed to by s as controlled by the string pointed to by format. The format shall be a multibyte character sequence, beginning and ending in its initial shift state. The format string consists of zero or more conversion specifiers and ordinary multibyte characters. A conversion specifier consists of a % character, possibly followed by an E or O modifier character (described later), followed by a character that determines the behavior of the conversion specifier. All ordinary multibyte characters (including the terminating null character) are copied unchanged into the array. If copying takes place between objects that overlap, the behavior is undefined. No more than maxsize characters are placed into the array.

IIUC "%E%J%P" is just a valid input for strftime. The %J part is just not a conversion specifier, and thus should be copied to the output as-is.

but the cplusplus.com docs on the function says if anything wrong happens, it should set the failbit on the stream and not crash the program

The standard wording for put_time is specified in [ext.manip]/10 and [locale.time.put.members]/1. Quoting the latter:

Effects: The first form steps through the sequence from pattern to pat_end, identifying characters that are part of a format sequence. Each character that is not part of a format sequence is written to s immediately, and each format sequence, as it is identified, results in a call to do_put; thus, format elements and other characters are interleaved in the output in the order in which they appear in the pattern. Format sequences are identified by converting each character c to a char value as if by ct.narrow(c, 0), where ct is a reference to ctype<charT> obtained from str.getloc(). The first character of each sequence is equal to '%', followed by an optional modifier character mod and a format specifier character spec as defined for the function strftime. If no modifier character is present, mod is zero. For each valid format sequence identified, calls do_put(s, str, fill, t, spec, mod).

Ditto, there's nothing wrong with put_time. As a result, the program should neither crash nor set failbit. (But you posted the program with a wrong parameter order - the output operand should be std::put_time(now, "%E%J%P").)

I believe it's UCRT's strftime that is buggy, and we can still rely on it if it gets fixed. However, a workaround would be needed if we want put_time to behave correctly as soon as possible.


Personal thought on workaround:

Don't call strftime/_Strftime in UCRT at this moment. Write a "corrected strftime" instead.

TheStormN commented 2 weeks ago

@frederick-vs-ja Thanks a lot for the input. Yes my example got a bit wrong while I was trying to do a minimal version of it. I've edited my post to correct it.

Personal thought on workaround:

Don't call strftime/_Strftime in UCRT at this moment. Write a "corrected strftime" instead.

Unfortunately this would not solve all the cases. For example std::format does also call std::put_time which calls strftime. I guess there might be more similar cases. Going to chase a huge codebase for all variants which lead to strftime is a bit an overkill.

From past experiences I think requesting any kind of behavioral changes to the UCRT usually gets ignored. Perhaps the best approach here would be for the STL team to create their own version and not use the UCRT provided one. This way not only they will avoid crashes but will also better conform to the C++ standard.

frederick-vs-ja commented 2 weeks ago

Going to chase a huge codebase for all variants which lead to strftime is a bit an overkill.

Fortunately, MSVC STL is not so that huge. I think this is the only concentrated relavent entry to UCRT. std::format eventually enters here when formatting chrono types. https://github.com/microsoft/STL/blob/ef1d621d51263285aff8e560a214f5477d63d687/stl/inc/xloctime#L749

AFAIK _Strftime doesn't call strftime, but both of them call _Wcsftime_l which raises non-conforming errors.

Perhaps it's still doable to add a _Strftime_v2 function which behaves as corrected _Strftime/strftime.

TheStormN commented 2 weeks ago

Ah, as a workaround you meant the STL codebase. I thought you implied the application code on which I'm working on.

Yep, I think this is doable and not too complicated. Also it should not affect binary compatibility.

@StephanTLavavej Any thoughts on this one?

StephanTLavavej commented 1 week ago

@frederick-vs-ja I think you missed this paragraph, WG14-N3220 7.29.3.5/6:

If a conversion specifier is not one of the ones previously specified, the behavior is undefined.

I think that the second part of your analysis is correct - while C says that unrecognized conversion specifiers are UB, C++ says that unrecognized conversion specifiers are copied unchanged ([locale.time.put.members]/1 "Each character that is not part of a format sequence is written to s immediately").

It seems that no UCRT changes are necessary here - instead we should alter the STL's wrapper layer to recognize unknown conversion specifiers and ensure that they're copied unchanged (after 5 seconds of thought, escaping them with %% would be minimally intrusive but I don't know how easy/viable that is).

TheStormN commented 1 week ago

@StephanTLavavej Will that also cover the cases where some more broken string is passed? For example, lets say I just forward user input to the std::put_time and in all cases I would not expect a crash(which strftime() will do) but some error which can be handled. In general I think in C++ no matter what is the format string, the app should not crash.

P.S. I think a much better solution would be for the UCRT team to alter the UB of strftime() to be more C++ friendly. :)

StephanTLavavej commented 1 week ago

Can you provide a specific example of a "broken string" that does not involve an unknown conversion specifier?

TheStormN commented 1 week ago

Hey, for example "%Y-%-m-%-d" but I guess that will also be taken care of by the escaping? Anyway, whichever workaround you choose, still the checks which strftime is already doing will have to be replicated. I think if the UCRT team alters the UB to conform to C++ we will have a big win here.

StephanTLavavej commented 1 week ago

Yeah, %- is an unknown conversion specifier, so I think that's the same issue.

Getting UCRT changes is extremely difficult - I think we should pursue STL changes here.

TheStormN commented 1 week ago

In that case I would recommend to not use strftime at all and create a complete STL rewrite.

I guess you should have access to UCRT source code, so coping most stuff from there would make this task much easier.

P.S. This will also solve the issues with std::format crashes when used with Chrono types.

TheStormN commented 1 week ago

@StephanTLavavej I've submitted a PR with simple fix: #4840

However, by looking at the code there I see significant effort to avoid _Strftime crashing. I still do think that if UCRT can't be changed, an new STL version of that function should be created which will not even require hacks to expand the string if more space is needed. Also strftime on Windows is really slow as it is doing char->wchar->char conversion and because of the current way std::put_time is using it by literally invoking it for each format specifier.

So is there any interest in doing STL native version of time_put to avoid calling _Strftime is general? This will also greatly benefit chrono related formatting.

StephanTLavavej commented 5 days ago

Improved performance and resistance to UCRT/STL specification divergence would be worthwhile, but that has to be balanced against the complexity and maintenance burden of such a change. Our immediate reactions were "rarrrrrgh!" (@barcharcraz) and "uhhhhhhhh?" (@StephanTLavavej) :joy_cat:.

We wouldn't want to encourage anyone to work on that, but we won't reject the idea out of hand.

TheStormN commented 5 days ago

Well, the fmt library on which the std::format implementation is based already did it and as you already have an agreement with the author, we can take some code from there. :))