hadrielk / string-interpolation

String interpolation proposal paper
3 stars 0 forks source link

Combination with user defined literals. #11

Open BengtGustafsson opened 8 months ago

BengtGustafsson commented 8 months ago

We need to specify if and how f/x literals can be combined with user defined literals.

The most reasonable interpretation of such a combination would be to stick the user defined suffix to the resulting literal, i.e. before the first comma leading up to the first expression-field. It is allowed today to concatenate literals without suffix with literals with suffix, and it is allowed to concatenate multiple literals with user defined suffix as long as it is the same suffix. The literals are concatenated before the user defined literal function is called.

Another more complicated interpretation is that an x literal with user defined suffix tries to call a user defined literal function with more than one parameter, i.e. first the literal and then the expression-fields. Currently all user defined string literal functions have two parameters so that would have to be changed for this to work out. Overload resolution on these additional parameters would work as usual, with one plausible possibility being a parameter pack.

With this in place it would in fact be possible to implement f literals as user defined literals, which of course has some advantages in that it is much less magic (i.e. std::format would not be magically called by the compiler). There are however some drawbacks:

  1. It diverges from to my knowledge all other languages supporting string interpolation, where the f is placed first.
  2. It precludes my "full expression" definition as the end of the string literal would have to be determined without knowing of the trailing f suffix. Thus any nested double quotes would have to be escaped.
  3. Parsing precludes a free-standing preprocessor or prevents using macros in expression-fields. This is as the correct interpretation of the literal contents is not known until the actual compiler, with its knowledge of available user defined literal function overloads, gets the string literal, which is after the macro expansion phase. In an integrated preprocessor/compiler environment the compiler could go back and ask the preprocessor to preprocess each expression-field but with a free-standing preprocessor this is not possible.

To me the third problem is a non-starter for this idea, while the others are also really bad.

It seems that the first possibility, where the suffix is added to the resulting string literal is the way to go. We could also forbid this but that seems unnecessary as it is easy to define, implement and understand what happens. I'm guessing that there are code bases where a user defined suffix is used for translation and we don't want to exclude those from using f/x-literals.

BengtGustafsson commented 7 months ago

If these drawbacks can be overcome and that this may be the way to go. I still have no solution for 3 that allows for a stand-alone preprocessor.

  1. We can live with a trailing f, or we can propose to allow user defined literals to optionally start with the prefix, if it doesn't conflict with u, u8. U or L.
  2. We can still allow full expressions, except unescaped quotes, but to handle that a raw literal can be used.
  3. This is still a big problem. We could define that when a suffix is seen by the preprocessor it performs the extraction and then calls the operator with the explicit operator""f(args) syntax. But this fails in the case that there is a valid { expr } in the literal, but only a C++20 overload, then the literal operator is called with too many parameters. There are two possible remedies: a) The preprocessor asks the compiler if there is only a C++20 overload and doesn't do the extraction if so. b) The preprocessor magically sends both the original and extracted literal to the compiler somehow. For a freestanding preprocessor this could amount to a fake call to a wrapper function (which doesn't exist) that takes both literals and discards one of them depending of the overload set of the suffix name. This also requires suppression of any errors encountered during macro expansion in the extracted expressions, unless there are new overloads, at which point the compiler must retroactively present the errors. c) The compiler is officially required to undo the extraction if there was no new overload. This doesn't work as there may be macro expansions that have been done in the expressions. There may also be errors caused in the process of such macro expansion which should not have been reported.

This leads me to believe that the only solution is the breaking change of requiring doubling of { and } characters in user defined string literals. While this on paper is a big breakage, in practice it should be rare, and it is always a loud error.

BengtGustafsson commented 7 months ago

But... if we include the prefix user defined literal position in the same proposal we can get it backwards compatible, and restore the f to be first: Always do extraction if the literal is prefix, never if it is suffix. This leaves us with the wording problem of defining the order of the u,U,u8,L and R prefixes, the user defined prefixes and the R for raw. As presented in Hadriel's paper and in my extract_fx implementation the f or x prefix is between the character set and raw prefixes. I don't see a technical issue here as long as we exclude the predefined prefixes from the allowed names of user defined prefixes. To not break code that already uses user defined suffixes starting with the predefined prefix letters we should probably invent a new syntax that separates the prefixes and suffixes, and what better than to follow the current idea of mimicking the place of use:

template<typename... Args> std::string operator f""(std::format_string<Args...>, Args... args);

Note that the f is before the "". This should not be hard to parse as it is just an identifier followed by the quotes.

We could also get rid of the size_t parameter which would make overloading for a std::format_string as the first parameter more logical. I'm guessing that the size parameter was there just to distinguish the string literal case from the integer and float literal cases, so this should not be a problem, as the literal would always be zero terminated anyway.

BengtGustafsson commented 7 months ago

There seems actually not to be any reason to use a special syntax for the function declaration, it could be some regular function name made up of a standard name followed by the prefix that the preprocessor found. One possibility is interpolated_literal_xxx where xxx is the identifier before the "" less any encoding and raw prefix.

A drawback is that there is a non-zero chance that someone used whatever standard name we come up with.

Note that even with this system you'd have to bring the function into the current scope to be able to use the literal.

Actually I think that the operator"" based name is a better idea, it conveys the intent better and avoids inventing a magic standard name.

BengtGustafsson commented 6 months ago

There is a forward compatiblility issue: As soon as we allow this we can't add more encoding or raw prefixes as that could be a breaking change. Seems less likely that we would want to though.

Mick235711 commented 4 months ago

One huge issue with the current implementation is that I would expect expanding a macro to force-generate a standard library function to be an instant no-no for WG21. The C++ core language and the standard library is very much separated, with very limited interaction mostly in "magic library functions" and the Language Supporting library. The most prominent interactions are:

Apart from the fact that many of those are already terrible mistakes we don't want to repeat (like the first one), all of those interactions are pretty weird in the sense of you cannot even write 1 <=> 2 without including <compare>. Another important aspect is that many of those are isolated standard library facilities; std::initializer_list does not interact with iterator facilities at all, and std::type_info is just a very very bare object that does not even provide a proper operator<. There is an effort to minimize the interaction and limit the correlation between standard library and the language, it seems.

Now back to the UDL situation. I think this is the only viable path forward: namely using UDL to call std::format instead of magically insert a call by macro:

// Provided by the standard library, perhaps in <format>
std::string operator""$(/* some magical arguments */) { return std::format(/* something */); }

"Hello {var}"$
// All the compiler needs to do, is when it finds a suitable operator"" with magical arguments, translate the above to
// operator""$("Hello {}", var)

This way, the compiler does not need to know about std::format at all, just do an extraction of all expressions in replacement fields and call the corresponding UDL operator.

Now for the main questions:

It diverges from to my knowledge all other languages supporting string interpolation, where the f is placed first.

Yes, but this is the least of my worries. First of all, this is just a taste thing, and "..."$ is not that weird; second, you can still do it like F"..."$, and make the work needs to be done for the compiler even simpler. It's just that having both prefix and postfix is even more novel.

Having a prefix is not even good for C++, since we don't have native string types:

"Hello" // <- const char[6], decays to const char* on virtually any touch
R"(Hello)" // <- still const char[6]
L"Hello" // <- const wchar_t[6]
F"Hello" // <- std::string (!)

All forms of string today, even with prefix, results in some kind of constant character array, but the proposed F"..." will definitely results in a std::string, which may leads to awkward inconsistencies. On the contrary, postfixes UDL can return whatever type they like ever since C++11.

It precludes my "full expression" definition as the end of the string literal would have to be determined without knowing of the trailing f suffix. Thus any nested double quotes would have to be escaped.

Yes, then just ban the use of double quotes inside replacement fields, as Python pre-3.12 does. Or alternatively, use F"..."$ which can again be parsed differently. You can still use " in FR"..."

Parsing precludes a free-standing preprocessor or prevents using macros in expression-fields.

If you don't rely on preprocess to extract the fields, then this is a non-issue.

You cannot use macros inside f-strings

There are a lot of workaround for this:

// 1. Force the user to use a variable
int line = __LINE__ + 1;
F"This is line {line}"

// 2. Migrate 80% usage of macros to more modern replacements
F"This is line {std::source_location::current().line()}"

// 3. State clearly that std::format should be used if you want a macro inside

Besides, C itself have been moving away from using macros as standard library functions since C99. Both abs() and fmax() in math.h had been mandated to not be a macro.

So in conclusion, what are the benefits of an UDL approach?

BengtGustafsson commented 3 months ago

I think we agree that a udl function is the way to go. My idea with the prefix was to use it to signal that the arguments are to be extracted, thereby allowing this to be done early enough for macro replacements to occur in them, and this without the preprocessor having to know anything about which such prefix-udl functions are defined (it does have to know about u,u8,U and L to make sure to not extract argument expressions in this case).

My main reason for allowing macros is that many names that look like functions are actually macros, with WIN32 as a notable example.