rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.93k stars 12.78k forks source link

Tracking Issue for `c"…"` string literals #105723

Closed tmandry closed 9 months ago

tmandry commented 1 year ago

This is a tracking issue for the RFC "c"…" string literals" (rust-lang/rfcs#3348). The feature gate for the issue is #![feature(c_str_literals)].

Steps / History

Unresolved Questions

CAD97 commented 1 year ago

w.r.t.

c'…' character literals

IIRC, C23 accepted N2653, which defines char8_t as a typedef for unsigned char. (C++20's char8_t is a distinct type.)

Assuming I'm correct that N2653 was accepted, it probably makes sense to define Rust's c"…" as being compatible with C's u"…" (char8_t[]) and Rust's c'…' as being compatible with C's u8'…' (char8_t).

To be explicit, this is not intended to argue that c"…"Rust should be usable as *const c_uchar, just that c"…".as_ptr()Rust should have the same memory content as u8"…"C. It may argue for c"…"Rust requiring UTF-8 well formedness; I do not know whether C's u8"…" allows constructing invalid UTF-8 via escapes.

Note that &CStr's pointer representation uses c_charRust, not c_ucharRust (which would match char8_tC). This serves as a counterargument that c strings in Rust should match bare string/character literals in C, which use unqualified charC, rather than the C17 (u8'…') and C23[^c23] (u8"…") prefixed literals.

[^c23]: To be fully accurate to my understanding, C17 added u8"…" literals of type array of char, and C23 changed their type to array of char8_t. Similarly, C17 added u8'…' literals of type unsigned char, and C23 changed their type to char8_t (for no semantic change, as char8_t is a typedef of unsigned char).

The presence of UTF-8 literals being unsigned char in C may additionally add weak motivation for a UTF-8 CStr type for Rust separate from CStr, which corresponds to the unqualified c_char type.

ojeda commented 1 year ago

Assuming I'm correct that N2653 was accepted

Yeah, char8_t was voted into C23.

It may argue for c"…"Rust requiring UTF-8 well formedness; I do not know whether C's u8"…" allows constructing invalid UTF-8 via escapes.

I think it was possible before, and will still be in C23:

"Any hexadecimal escape sequence or octal escape sequence specified in a u8, u, or U string specifies a single char8_t, char16_t, or char32_t value and may result in the full character sequence not being valid UTF-8, UTF-16, or UTF-32."

dead-claudia commented 1 year ago

I do want to note a couple things in the alternatives:

  1. While I like the idea of a FromStringLiteral, error messages would need at least a character offset to be actually useful beyond a mere compile-time panic!. Also, &CStrs are really more like byte string literals, and that offset would have to be a byte offset, not a char offset, too.

  2. If some platforms use a u8 for c_char while others use a i8, you need the c'...' literal to be assignable to c_char regardless of which it is in order to be portable. Allowing it to coerce to either i8 or u8 is probably better than having it exactly what c_char is, though. (Ideally, c_char should've been #[repr(transparent)] struct c_char(pub i8/u8); to force people to differentiate the two, but the time to change to that is long past.)

tmandry commented 1 year ago

@CAD97 @dead-claudia Would you be willing to open issues for these points and add the F-c_str_literal label (or tag me and I can add it)? Otherwise they are likely to get lost in the discussion. Tracking issues are meant to serve as a hub to point to other threads, not to be a place for technical discussion themselves.

dead-claudia commented 1 year ago

Created https://github.com/rust-lang/rust/issues/106479 for my first point. The second point's not likely to be relevant before a PR, and there's really not much to discuss there.

tgross35 commented 1 year ago

Should we make &CStr a thin pointer before stabilizing this? (If so, how?)

I don't think there would be anything about that change that should block this, it would just have to be updated whenever CStr becomes thin. The change itself is blocked, I think https://github.com/rust-lang/rust/issues/81513 would solve it, but it is still quite young. Seems like there hasn't been much discussion on the topic recently anyway - https://github.com/rust-lang/rust/issues/59905 is the only issue I could find, but it's pretty unofficial and stale

dtolnay commented 1 year ago

c"…" literals don't work right in proc macros — https://github.com/rust-lang/rust/issues/112820.

dtolnay commented 1 year ago

I updated the summary comment with recent history.

jmillikin commented 1 year ago

What work is remaining to stabilize this feature? It looks like the remaining checkboxes are documentation updates and then a stabilization PR -- is that it, or is there a blocker somewhere?

estebank commented 1 year ago

@jmillikin I believe the unresolved questions need to be answered one way or another, even if the answer is "not support this before stabilization and punt for later".

kanashimia commented 1 year ago

Should the (unstable) https://github.com/rust-lang/rust/issues/87555 accept C string literals? (If so, should it evaluate to a C string or byte string?)

I think the answer for that should be similar as with concat! macro, so currently NO. Reasoning is similar to why you can't concat_bytes!(b"foo", "bar") and can't concat!("foo", b"bar"). Not a blocker, but it is a possibility for a discussion in the future. The broader question would be about how to generalise those macros against all different types of strings in the future? Would be nice to not create another macro for each literal. That is better suited to be discussed somewhere else, perhaps in https://github.com/rust-lang/rust/issues/87555 or on Zulip.

Should we make &CStr a thin pointer before stabilizing this? (If so, how?)

Would be nice, but not a blocker.

jmillikin commented 1 year ago

I'll take a crack at offering non-official but opinion-ful answers for the unanswered questions.

Also add c'…' C character literals? (u8, i8, c_char, or something more flexible?)

Answer: no

Byte b'.' literals make sense because a b"..." and &[u8; N] are isomorphic -- there's no requirements about byte content, so b"abc" and &[b'a', b'b', b'c'] are clearly equivalent for all purposes.

C string literals c"..." don't have this property, and it doesn't make sense to try to write something like &[c'a', c'b', c'c'] because that's not a valid C string (no terminal NUL). The only thing c'.' literals could offer over today's b'.' is implicit coercion to c_char on platforms where that type is signed -- which has some value, but it's not really related to C string literals.

Presumably the parser is flexible enough (given b"..." and b'.' co-existing today) that c'.' literals could be added in the future should the need prove sufficient.

Should we make &CStr a thin pointer before stabilizing this? (If so, how?)

Answer: no

It's already possible to create a &CStr in a const context (as of Rust v1.59), so this feature isn't providing capabilities or enabling functionality that would need to be walked back if the internal representation of CStr changes. The following block of code does the same thing as a c"..." literal and it's valid stable Rust:

const fn c(bytes: &[u8]) -> &core::ffi::CStr {
    unsafe { core::ffi::CStr::from_bytes_with_nul_unchecked(bytes) }
}

const HELLO: &core::ffi::CStr = c(b"hello, world!\x00");

Should the (unstable) concat_bytes macro accept C string literals? (If so, should it evaluate to a C string or byte string?)

Answer: no

C strings have two equally valid &[u8] representations (with or without terminal NUL), so allowing them to be used in concat_bytes! would be ambiguous.

If people really want to concatenate c"..." literals for some reason, a concat_cstrs! macro would avoid the ambiguity because there's only one possible answer for "NUL or no?" to each &CStr being concatenated.

Should there be a valid UTF-8 C string type that c"..." string literals support?

Answer: sounds useful but unrelated to this feature

Since the &CStr type represents C strings in an undefined encoding the c"..." literal syntax should do the same. The ability to have values that assert dual properties of (1) valid C string and (2) valid UTF-8 seems useful, but that would need to be its own type (core::ffi::Utf8CStr ?), and a separate literal syntax (cu"..." ?) to avoid type ambiguity.

After all, every &Utf8CStr would also be a valid &CStr, but the opposite would not be true, so if c"..." could be either one then the following code would be a type error:

impl AsRef<CStr> for Utf8CStr;

fn print_c_str(s: &impl AsRef<CStr>) { ... }

print_c_str(c"\xc3\xb5"); // ambiguous AsRef

Alternatively, if the compiler prioritized &CStr when inferring the type of c"..." literals, you'd end up with confusing and spooky behavior:

print_c_str(c"\xc3\xb5"); // does this print "õ" or "õ"? can't tell from call site
tgross35 commented 1 year ago

I'll take a crack at offering non-official but opinion-ful answers for the unanswered questions.

Also add c'…' C character literals? (u8, i8, c_char, or something more flexible?)

Answer: no

I actually wouldn't mind having this since I occasionally come across FFI places where I need to compare a single c_char to a literal, and I can't just use 'y' or b'y' because c_char is signed on most platforms. If c_char: PartialEq<u8> worked than that would be nice, but type aliases....

Anyway, I think this is something that doesn't need a decision at this time.

Other answers seem reasonable to me, at least I don't think there are any blockers.


One minor concern I have is that the reimplementation of #113476 has only been on stable less than a month and #116124 is not yet in stable. But I think that is fine assuming the stabilization of this feature won't land until 1.76 at the earliest.

I do also kind of like the RFC's proposed idea that let x: &Cstr = "foo"; work instead of adding the c"..." syntax, especially so macros can accept a CStr literal where needed and macro users can write an unprefixed string literal (common case for plugin-type things). But that can likely be a separate discussion.

https://github.com/dtolnay/syn/issues/1502 is a big ecosystem thing but it shouldn't block anything

I think that it would be reasonable to open a stabilization PR if you are up for it, should be pretty minimal and the lang team can discuss / FCP there.

jmillikin commented 1 year ago

Concerns about soak time make sense. I'll hope for the best in regard to timing, but if it has to wait an extra release or two to verify ecosystem compatibility then that's just life.

I think that it would be reasonable to open a stabilization PR if you are up for it, should be pretty minimal and the lang team can discuss / FCP there.

OK, PR https://github.com/rust-lang/rust/pull/117472 created -- this is my first time trying to stabilize a rustc feature, so hopefully I didn't mangle it too badly.

madsmtm commented 1 year ago

Just noting that this interacts with the recently accepted RFC 3349 (tracking issue), which modifies the allowed escape codes in strings.

I checked the table in that RFC, and I think the rows for c"..."/cr"..." would be exactly the same as for b"..."/br"...".

WaffleLapkin commented 11 months ago

Closing this as per the merged stabilization PR: https://github.com/rust-lang/rust/pull/117472

GoldsteinE commented 10 months ago

Stabilization is reverted in #119528, so probably this should be open again?

traviscross commented 10 months ago

@rustbot labels +I-lang-nominated

Nominating to discuss and ensure we're OK with the restabilization that will occur automatically if we take no action.

traviscross commented 10 months ago

@rustbot labels -I-lang-nominated

We discussed this in the triage call today and agreed, given that #119172 has landed, that we were happy to see this restabilize in Rust 1.77.

stefson commented 9 months ago

Was this fully stabilized for rust-1.77.0_beta? I am failing to bootstrap the beta from the beta tarball with: error[E0658]: c".." literals are experimental

ehuss commented 9 months ago

@stefson Yes, it was fully stabilized in 1.77. If you are unable to determine your issue, I would suggest opening a new issue with the exact reproduction steps and commands to run.

I'm going to close this issue as the revert only happened on the old beta branch (1.76), and is on track as stabilized in the new beta branch (1.77).