[SUGGESTION] in cpp2 mode, string literals should be string_literal by default

neumannt commented 2 years ago

Raw pointers are fundamentally problematic if we have to do pointer arithmetic or offset based access. A major source for pointers, that we inherited from C, are strings. There are safer alternatives to C strings (i.e., string_view), but they are not used by default. We still need classic C strings for compatibility reasons, but the default should be a safe construct.

Thus, I would suggest that a string literal in cpp2 mode is considered a string_view by default. Traditional C strings could be constructed by using, e.g., the c suffix:

cpp2: "abc" -> cpp: "abc"sv cpp2: "abc"c -> cpp: "abc"

That would make string handling much cleaner and safer. Compatibility with existing code is a concern, but as cpp2 code is by definition new anyway, we can always add the 'c' suffix if needed to get a traditional C string.

marioarbras commented 2 years ago

I think this is a great idea, but I propose a slightly different approach. Instead of having string literals be string_view's in Cpp2 syntax, I propose creating a new string literal type.

The reason for this is because, in general, string_view does not represent a null-terminated string. It will if you initialize it with a C string literal, but it's not something that's true for all use cases, and might introduce bugs if it gets misused.

Consider how you would interface with a C library assuming Cpp2 string literals are string_view's:

str:= "a Cpp2 string literal"; // actually a std::string_view

some_c_function(str.data()); // OK, str.data() is null-terminated

You have no option but to call the member function .data() to get a const char* out. A user might be tempted to do the same thing when passing a string_view that was not initialized from a string literal:

str: std::string = "a regular Cpp string";
substr: std::string_view = (str.begin(), std::next(str.begin(), 3));

some_c_function(substr.data()); // ¯\_(ツ)_/¯

I suggest implementing strongly-typed string literals in Cpp2. This type could have an implicit conversion operator to string_view and should only have a constructor from a string literal. This type can also have a .c_str() member function, just like std::string, that returns a const char* guaranteed to be null-terminated. I've seen some implementations of such types, sometimes called fixed_string or string_literal. Internally, this type is just a wrapper around a reference to the string literal (guaranteed to be valid for the duration of the program) with a compile-time known size and so does not add any overhead.

The code above then becomes

str:= "a Cpp2 string literal"; // actually a cpp2::string_literal
str2: cpp2::string_literal = "another Cpp2 string literal"; // the type can also be given explicitly

some_c_function(str.c_str()); // Guaranteed to be null-terminated

Of course, this doesn't prevent the user from passing a non null-terminated string to a C function, but it does a much better job at educating users to call .c_str() whenever they need to interface with a C function.

In fact, this could be pushed even further and have cppfront emit a warning (or error?) whenever it sees std::string_view::data() being used to pass an argument to a function taking const char*, although I don't know if it can be implemented.

hsutter commented 2 years ago

Raw pointers are fundamentally problematic if we have to do pointer arithmetic or offset based access. A major source for pointers, that we inherited from C, are strings.

I think this is well said... is it string literals themselves, or the pointer arithmetic, that is the source of bugs?

Lifetime and null safety: The pointers to them are always non-null and safe to dereference.
Concurrency safety: They have pretty good immutability protection from modification.
Bounds safety: The strings themselves are guaranteed to be null-terminated so correctly written C-style functions will never fall off the end. The problematic part is navigating them using pointers... but Cpp2 already bans safe code from doing that and so people are forced to use them safely such as via string_view or conversion to string.

So given that Cpp2 bans pointer arithmetic in safe code, is there a remaining problem with string literals that forcing them to be strongly typed will solve, and reduce a class of CVEs or reduce a class of things we have to teach? I'm not against it, I'm interested, I just want to be sure it's solving a problem not already solved... What do you think?

in general, string_view does not represent a null-terminated string.

Yes, that's a major reason we can't use it generally for stringlike things. This isn't a criticism of string_view BTW... there are reasons it is the way it is such as that substr couldn't work in-place otherwise, and I do want to move to more string_view, see for example cpp2util.h's contract violation handler message parameter comment.

I suggest implementing strongly-typed string literals in Cpp2. This type could have an implicit conversion operator to string_view and should only have a constructor from a string literal. This type can also have a .c_str() member function, just like std::string, that returns a const char* guaranteed to be null-terminated. I've seen some implementations of such types, sometimes called fixed_string or string_literal. Internally, this type is just a wrapper around a reference to the string literal (guaranteed to be valid for the duration of the program) with a compile-time known size and so does not add any overhead.

Thanks. If we do need a strongly typed string literal, that sounds promising.

neumannt commented 2 years ago

You are right that the bad things will effectively happen in cpp mode, as cpp2 bans pointer arithmetic. But from a safety perspective it is nevertheless bad to teach everybody that strings are char*. Consider this code here:

myprint: ( c: * const char ) = {
   std::cout << "printing " << c << "\n";
}

main: () -> int = {
    myprint("ok");
    c:char = '?';
    myprint(c&); // bad
}

It will compile happily in cpp2 mode, but bad things happen. Of course the real problem is the type of myprint, not the string literal per se. But as long as string literals are simple pointers people will pass strings like that. Note, by the way, that this here compiles, too:

myprint: ( c: * const char ) = {
   std::cout << "printing " << c << "\n";
}

main: () -> int = {
    myprint("abc"+5); // not string concat!
}

But that is probably just a limitation of the current cppfront.

Mario's suggestion of introducing a string_literal class is actually a great idea, it offers the safety of a string_view and can still fall back safely to C pointers if needed.

switch-blade-stuff commented 2 years ago

I do support this idea as well.

The proposed string_literal type can work the same way that initializer_list already works, where it represents a compile-time string literal and nothing else, and cannot be constructed by the user (can be done via an internal namespace factory to hide public constructors).

It would also seamlessly integrate with basic_string (and basic_string_view after C++23), as it would be a contiguous range that has an implicit cast operator to basic_string_view.

I would also suggest an explicit cast to const C * (where C is the literal character type), to enable interface with the C-style string APIs, and to make template & concept metaprogramming easier.

EDIT: Made a quick mock-up as an experiment on godbolt.

neumannt commented 2 years ago

I just realized that my example will fail (as in: compile and crash) even if the signature of myprint would be (c: string_view)... The implicit conversion of const char* to string_view is really unfortunate. But if the parameter type is string_view the compiler has at least the chance to complain when passing the pointer to a char variable. If the type is const char* the compiler cannot know if a string is expected or really a pointer to a character.

switch-blade-stuff commented 2 years ago

While, to my knowledge, there are no CVEs directly caused by string literals themselves, discouraging the use of raw pointers is a good idea and would force developers to avoid pointer operations and C-style string APIs, which can avoid sudden bugs as @neumannt has mentioned.

Another advantage of using a custom literal type is that it would be an actual contiguous_range, and as such can be used with templates as a range.

marioarbras commented 2 years ago

Lifetime and null safety: The pointers to them are always non-null and safe to dereference.

Concurrency safety: They have pretty good immutability protection from modification.

Bounds safety: The strings themselves are guaranteed to be null-terminated so correctly written C-style functions will never fall off the end. The problematic part is navigating them using pointers... but Cpp2 already bans safe code from doing that and so people are forced to use them safely such as via string_view or conversion to string.

I agree with all this. However, do we really want to be forced to wrap string literals with string_view just to be able to navigate them?

So given that Cpp2 bans pointer arithmetic in safe code, is there a remaining problem with string literals that forcing them to be strongly typed will solve, and reduce a class of CVEs or reduce a class of things we have to teach? I'm not against it, I'm interested, I just want to be sure it's solving a problem not already solved... What do you think?

If all Cpp2 functions that require a string parameter are forced to take it as a std::string or std::string_view, then I think string literals don't need to be strongly typed. But Cpp2 doesn't ban functions from taking a const char*parameter to represent a string literal. In that case we are always forced to wrap it in a string_view inside the function because navigation is banned on const char*.

The same is true if we create a string literal and immediately need to navigate it, without passing it to a function. We have no alternative but to wrap it in a string_view. This reminds me of your CppCon 2022 talk at 59:49. I know you were talking about bounds checking, but it's an extra variable we have to introduce just to be able to perform navigation. Another thing to consider is code with mixed Cpp1-Cpp2 syntax, where pointer arithmetic is allowed. It will take a while before all code is rewritten in Cpp2 syntax.

From your recent ABI design note in the wiki you mention that

(...) Cpp2 presents (...) the opportunity to make that code mean what we want it to mean without worrying about breaking changes...

Whenever we write the following

str:= "I want a string literal, please and thank you"

we are asking for a string literal, but in return we get a pointer 😢. It seems like the code we wrote doesn't mean what we wanted it to mean.

Suppose we didn't have a std::array in C++. We would have to create a C-style array and always wrap it in std::span just to safely navigate it, but having std::array is much more convenient. I always felt that a strongly-typed string literal was a missed opportunity in C++.

@switch-blade-stuff, I was actually thinking of a slightly different implementation. cpp2::string_literal's size should be a non-type template parameter. This way we only store a reference to the string literal it represents, so there's really no overhead. We also don't need the make_string_literal helper. All we need is a deduction guide to help us construct from a C string literal without having to specify the size as a template argument.

In my opinion this doesn't have to be a special type like initializer_list. The only special thing about cpp2::string_literal is that, in Cpp2, string literals in the form "..." would be deduced as cpp2::string_literal's. For example:

str:= "a cpp2 string literal" // deduced as cpp2::string_literal
c_str: *const char = "a C-style string literal" // explicit type provided

Alternatively, you could instead completely ban C-style strings.

Also, I don't think we should have the conversion operator to const char*. We have .c_str() already and it would probably be good to keep the interface consistent with std::string.

I have an implementation of basic_string_literal<Char, N, Traits = ...> that I use personally. The only thing I don't like about it is the user-defined string literal I had to create and the fact I need to call using namespace ...::string_literals everywhere, which adds a lot of noise. With Cpp2, this wouldn't be an issue because you wouldn't need a user-defined string literal.

Here's a very simplified implementation on godbolt with a very incomplete example. I'd be happy to submit a PR with an implementation of cpp2::basic_string_literal at some point if you think this would be worth adding to your experiment! 😃

willwray commented 2 years ago

This suggestion, that the language generate a library type for a problematic language object, follows a pattern in recent C++ evolution. The case of std::initializer_list has already been invoked. This is not a great precedent. It increases complexity without fixing underlying safety issues, so goes against stated goals of cpp2 to simplify the language and make it safer overall.

Now, cpp2 can change the semantics of syntax2 so is free to break from the expeditious ISO approach (library proposals are easier to land than language proposals and not so constrained by compatibility).

Why not instead improve the language array that is builtin to cpp2, so making all arrays better and safer, string literals included. I believe this can be done without any new syntax, i.e. with only semantic changes to existing syntax.

First, disallow the implicit eager array decay to pointer. Second, add array-array copy semantics as proposed in P1997 (cpp2 can go further to address formal array parameters). Third, allow array-array comparisons (lexical as usual, following element comparability). Note that string literals are already special-cased with copy-init semantics, in C and C++.

These semantic changes make the cpp2 builtin array a regular type, copyable and equality comparable, and get rid of C array decay (if needed, a pointer can be explicitly extracted as begin("hello") or &"hello"[0]).

Language-provided string literals are special. They have important implementation defined behavior such as possible 'interning' - sharing of storage - that means their ids cannot be guaranteed unique. Cpp2 could make this language defined behavior. On the other hand, it's probably best left as is for implementers to decide.

Thornier is the issue of what to do about null termination. It's needed for compatibility yet also still useful and a good choice in many cases. This might argue for a bifurcation. One way to do this would be with a language span type that only binds to constexpr null terminated char arrays.

Let's strive to fix underlying language issues in C++ and its C subset first.

hsutter commented 1 year ago

Related to the part of #159 that proposes a prefix (e.g., F ) to opt into interpolation.

feature-engineer commented 1 year ago

@marioarbras

The same is true if we create a string literal and immediately need to navigate it, without passing it to a function.

Have you ever seen anyone do this? Why would you need to navigate an immutable literal? You already know at compile time every detail about it... If you need to take certain parts of it and examine them, why not save them in separate variables? It'd make the code much cleaner to give these parts names anyway than to have the code navigate the single variable... Are you thinking of some sort of compile time meta-compiler which takes strings to processes them? Sounds like a bad idea to me.

switch-blade-stuff commented 1 year ago

Are you thinking of some sort of compile time meta-compiler which takes strings to processes them? Sounds like a bad idea to me.

This is an implementation-defined trick, but what you can do is pass __PRETTY_FUNC__/__FUNCSIG__ as a non-type template parameter, then parse and trim that string at compile-time to generate a pretty type name for arbitrary types.

The resulting type name constant will also be completely stripped of the rest of the function signature fluff.

jcanizales commented 1 year ago

Just to note, compile-time code execution is indeed part of the roadmap of CPP2.

feature-engineer commented 1 year ago

compile-time code execution is great, what I have doubts about is the need for compile-time string literal slicing. If you've got several parts of interest in a single string literal such that you need to parse it and slice it to get the relevant parts - it'd just be easier and more readable to put these parts in separate variables to begin with... Unless there's overlap between the slices - but that's a very niche scenario, which doesn't justify adding a language feature IMO.

feature-engineer commented 1 year ago

This is an implementation-defined trick, but what you can do is pass __PRETTY_FUNC/FUNCSIG__ as a non-type template parameter

Does this trick require the use of pointer arithmetic? If so, how bad would it be if we wrapped this parameter in a string_view in order to parse and trim it? How often would we encounter such code?

switch-blade-stuff commented 1 year ago

You would wrap it in a string_view like container for parsing, parsing pointer strings isn't the most convenient thing to do. As for frequency, it definitely isn't something youll see every day.

hsutter commented 1 year ago

Thanks again for this suggestion. After re-reading through the thread, and trying out a small cpp2::string_literal wrapper, I'm going to defer this for now.

When I started to implement it, I found that I wanted to do it right including supporting u8"literal" and char16_t* and such, which are getting into areas (Unicode) where I'm far from expert. It seemed more work than I can justify for the benefit I can see at the moment given the other safety features Cpp2 has or is en route to getting which eliminates many of the problems of string literals in today's C++, so I backed out the implementation work I had started for this. But I'll keep this in mind for the future, thanks again for the suggestion!

hsutter / cppfront

[SUGGESTION] in cpp2 mode, string literals should be string_literal by default #45