Closed tmandry closed 9 months ago
w.r.t.
c'…'
character literals
IIRC, C23 accepted N2653, which defines char8_t
as a typedef for unsigned char
. (C++20's char8_t
is a distinct type.)
Assuming I'm correct that N2653 was accepted, it probably makes sense to define Rust's c"…"
as being compatible with C's u"…"
(char8_t[]
) and Rust's c'…'
as being compatible with C's u8'…'
(char8_t
).
To be explicit, this is not intended to argue that c"…"
Rust should be usable as *const c_uchar
, just that c"…".as_ptr()
Rust should have the same memory content as u8"…"
C. It may argue for c"…"
Rust requiring UTF-8 well formedness; I do not know whether C's u8"…"
allows constructing invalid UTF-8 via escapes.
Note that &CStr
's pointer representation uses c_char
Rust, not c_uchar
Rust (which would match char8_t
C). This serves as a counterargument that c
strings in Rust should match bare string/character literals in C, which use unqualified char
C, rather than the C17 (u8'…'
) and C23[^c23] (u8"…"
) prefixed literals.
[^c23]: To be fully accurate to my understanding, C17 added u8"…"
literals of type array of char
, and C23 changed their type to array of char8_t
. Similarly, C17 added u8'…'
literals of type unsigned char
, and C23 changed their type to char8_t
(for no semantic change, as char8_t
is a typedef of unsigned char
).
The presence of UTF-8 literals being unsigned char
in C may additionally add weak motivation for a UTF-8 CStr
type for Rust separate from CStr
, which corresponds to the unqualified c_char
type.
Assuming I'm correct that N2653 was accepted
Yeah, char8_t
was voted into C23.
It may argue for c"…"Rust requiring UTF-8 well formedness; I do not know whether C's u8"…" allows constructing invalid UTF-8 via escapes.
I think it was possible before, and will still be in C23:
"Any hexadecimal escape sequence or octal escape sequence specified in a u8
, u
, or U
string specifies a single char8_t
, char16_t
, or char32_t
value and may result in the full character sequence not being valid UTF-8, UTF-16, or UTF-32."
I do want to note a couple things in the alternatives:
While I like the idea of a FromStringLiteral
, error messages would need at least a character offset to be actually useful beyond a mere compile-time panic!
. Also, &CStr
s are really more like byte string literals, and that offset would have to be a byte offset, not a char
offset, too.
If some platforms use a u8
for c_char
while others use a i8
, you need the c'...'
literal to be assignable to c_char
regardless of which it is in order to be portable. Allowing it to coerce to either i8
or u8
is probably better than having it exactly what c_char
is, though. (Ideally, c_char
should've been #[repr(transparent)] struct c_char(pub i8/u8);
to force people to differentiate the two, but the time to change to that is long past.)
@CAD97 @dead-claudia Would you be willing to open issues for these points and add the F-c_str_literal
label (or tag me and I can add it)? Otherwise they are likely to get lost in the discussion. Tracking issues are meant to serve as a hub to point to other threads, not to be a place for technical discussion themselves.
Created https://github.com/rust-lang/rust/issues/106479 for my first point. The second point's not likely to be relevant before a PR, and there's really not much to discuss there.
Should we make &CStr a thin pointer before stabilizing this? (If so, how?)
I don't think there would be anything about that change that should block this, it would just have to be updated whenever CStr becomes thin. The change itself is blocked, I think https://github.com/rust-lang/rust/issues/81513 would solve it, but it is still quite young. Seems like there hasn't been much discussion on the topic recently anyway - https://github.com/rust-lang/rust/issues/59905 is the only issue I could find, but it's pretty unofficial and stale
c"…"
literals don't work right in proc macros — https://github.com/rust-lang/rust/issues/112820.
I updated the summary comment with recent history.
The implementation of c"…"
, which first landed in nightly-2023-05-06, has been reverted in nightly-2023-07-07 by #113334
Breakage found in the 1.71-beta crater runs is described in #113235
Discussion around a reimplementation is happening in #113333
What work is remaining to stabilize this feature? It looks like the remaining checkboxes are documentation updates and then a stabilization PR -- is that it, or is there a blocker somewhere?
@jmillikin I believe the unresolved questions need to be answered one way or another, even if the answer is "not support this before stabilization and punt for later".
Should the (unstable) https://github.com/rust-lang/rust/issues/87555 accept C string literals? (If so, should it evaluate to a C string or byte string?)
I think the answer for that should be similar as with concat!
macro, so currently NO.
Reasoning is similar to why you can't concat_bytes!(b"foo", "bar")
and can't concat!("foo", b"bar")
.
Not a blocker, but it is a possibility for a discussion in the future.
The broader question would be about how to generalise those macros against all different types of strings in the future? Would be nice to not create another macro for each literal. That is better suited to be discussed somewhere else, perhaps in https://github.com/rust-lang/rust/issues/87555 or on Zulip.
Should we make &CStr a thin pointer before stabilizing this? (If so, how?)
Would be nice, but not a blocker.
I'll take a crack at offering non-official but opinion-ful answers for the unanswered questions.
Also add
c'…'
C character literals? (u8
,i8
,c_char
, or something more flexible?)
Answer: no
Byte b'.'
literals make sense because a b"..."
and &[u8; N]
are isomorphic -- there's no requirements about byte content, so b"abc"
and &[b'a', b'b', b'c']
are clearly equivalent for all purposes.
C string literals c"..."
don't have this property, and it doesn't make sense to try to write something like &[c'a', c'b', c'c']
because that's not a valid C string (no terminal NUL
). The only thing c'.'
literals could offer over today's b'.'
is implicit coercion to c_char
on platforms where that type is signed -- which has some value, but it's not really related to C string literals.
Presumably the parser is flexible enough (given b"..."
and b'.'
co-existing today) that c'.'
literals could be added in the future should the need prove sufficient.
Should we make
&CStr
a thin pointer before stabilizing this? (If so, how?)
Answer: no
It's already possible to create a &CStr
in a const
context (as of Rust v1.59), so this feature isn't providing capabilities or enabling functionality that would need to be walked back if the internal representation of CStr
changes. The following block of code does the same thing as a c"..."
literal and it's valid stable Rust:
const fn c(bytes: &[u8]) -> &core::ffi::CStr {
unsafe { core::ffi::CStr::from_bytes_with_nul_unchecked(bytes) }
}
const HELLO: &core::ffi::CStr = c(b"hello, world!\x00");
Should the (unstable)
concat_bytes
macro accept C string literals? (If so, should it evaluate to a C string or byte string?)
Answer: no
C strings have two equally valid &[u8]
representations (with or without terminal NUL
), so allowing them to be used in concat_bytes!
would be ambiguous.
If people really want to concatenate c"..."
literals for some reason, a concat_cstrs!
macro would avoid the ambiguity because there's only one possible answer for "NUL
or no?" to each &CStr
being concatenated.
Should there be a valid UTF-8 C string type that
c"..."
string literals support?
Answer: sounds useful but unrelated to this feature
Since the &CStr
type represents C strings in an undefined encoding the c"..."
literal syntax should do the same. The ability to have values that assert dual properties of (1) valid C string and (2) valid UTF-8 seems useful, but that would need to be its own type (core::ffi::Utf8CStr
?), and a separate literal syntax (cu"..."
?) to avoid type ambiguity.
After all, every &Utf8CStr
would also be a valid &CStr
, but the opposite would not be true, so if c"..."
could be either one then the following code would be a type error:
impl AsRef<CStr> for Utf8CStr;
fn print_c_str(s: &impl AsRef<CStr>) { ... }
print_c_str(c"\xc3\xb5"); // ambiguous AsRef
Alternatively, if the compiler prioritized &CStr
when inferring the type of c"..."
literals, you'd end up with confusing and spooky behavior:
print_c_str(c"\xc3\xb5"); // does this print "õ" or "õ"? can't tell from call site
I'll take a crack at offering non-official but opinion-ful answers for the unanswered questions.
Also add
c'…'
C character literals? (u8
,i8
,c_char
, or something more flexible?)Answer: no
I actually wouldn't mind having this since I occasionally come across FFI places where I need to compare a single c_char
to a literal, and I can't just use 'y'
or b'y'
because c_char
is signed on most platforms. If c_char: PartialEq<u8>
worked than that would be nice, but type aliases....
Anyway, I think this is something that doesn't need a decision at this time.
Other answers seem reasonable to me, at least I don't think there are any blockers.
One minor concern I have is that the reimplementation of #113476 has only been on stable less than a month and #116124 is not yet in stable. But I think that is fine assuming the stabilization of this feature won't land until 1.76 at the earliest.
I do also kind of like the RFC's proposed idea that let x: &Cstr = "foo";
work instead of adding the c"..."
syntax, especially so macros can accept a CStr
literal where needed and macro users can write an unprefixed string literal (common case for plugin-type things). But that can likely be a separate discussion.
https://github.com/dtolnay/syn/issues/1502 is a big ecosystem thing but it shouldn't block anything
I think that it would be reasonable to open a stabilization PR if you are up for it, should be pretty minimal and the lang team can discuss / FCP there.
Concerns about soak time make sense. I'll hope for the best in regard to timing, but if it has to wait an extra release or two to verify ecosystem compatibility then that's just life.
I think that it would be reasonable to open a stabilization PR if you are up for it, should be pretty minimal and the lang team can discuss / FCP there.
OK, PR https://github.com/rust-lang/rust/pull/117472 created -- this is my first time trying to stabilize a rustc feature, so hopefully I didn't mangle it too badly.
Just noting that this interacts with the recently accepted RFC 3349 (tracking issue), which modifies the allowed escape codes in strings.
I checked the table in that RFC, and I think the rows for c"..."
/cr"..."
would be exactly the same as for b"..."
/br"..."
.
Closing this as per the merged stabilization PR: https://github.com/rust-lang/rust/pull/117472
Stabilization is reverted in #119528, so probably this should be open again?
@rustbot labels +I-lang-nominated
Nominating to discuss and ensure we're OK with the restabilization that will occur automatically if we take no action.
@rustbot labels -I-lang-nominated
We discussed this in the triage call today and agreed, given that #119172 has landed, that we were happy to see this restabilize in Rust 1.77.
Was this fully stabilized for rust-1.77.0_beta? I am failing to bootstrap the beta from the beta tarball with: error[E0658]: c".." literals are experimental
@stefson Yes, it was fully stabilized in 1.77. If you are unable to determine your issue, I would suggest opening a new issue with the exact reproduction steps and commands to run.
I'm going to close this issue as the revert only happened on the old beta branch (1.76), and is on track as stabilized in the new beta branch (1.77).
This is a tracking issue for the RFC "
c"…"
string literals" (rust-lang/rfcs#3348). The feature gate for the issue is#![feature(c_str_literals)]
.Steps / History
Unresolved Questions
c'…'
C character literals? (u8
,i8
,c_char
, or something more flexible?)&CStr
a thin pointer before stabilizing this? (If so, how?)concat_bytes
macro accept C string literals? (If so, should it evaluate to a C string or byte string?)c"..."
string literals support? (comment 1, comment 2)