rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.2k stars 12.56k forks source link

Tracking Issue for `ascii::Char` (ACP 179) #110998

Open scottmcm opened 1 year ago

scottmcm commented 1 year ago

Feature gate: #![feature(ascii_char)] #![feature(ascii_char_variants)]

This is a tracking issue for the ascii::Char type from https://github.com/rust-lang/libs-team/issues/179

https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html

Public API

// core::ascii

#[repr(u8)]
enum Char {
    Null = 0,
    …
    Tilde = 127,
}

impl Debug for Char { … }
impl Display for Char { … }
impl Default for Char { ... }

impl Step for Char { ... } // so `Range<Char>` is an Iterator

impl Char {
    const fn from_u8(x: u8) -> Option<Self>;
    const unsafe fn from_u8_unchecked(x: u8) -> Self;
    const fn digit(d: u8) -> Option<Self>;
    const unsafe fn digit_unchecked(d: u8) -> Self;
    const fn as_u8(self) -> u8;
    const fn as_char(self) -> char;
    const fn as_str(&self) -> &str;
}

impl [Char] {
    const fn as_str(&self) -> &str;
    const fn as_bytes(&self) -> &[u8];
}

impl From<Char> for u8 { … }
impl From<Char> for char { … }
impl From<&[Char]> for &str { … }

// core::array

impl<const N: usize> [u8; N] {
    const fn as_ascii(&self) -> Option<&[ascii::Char; N]>;
    const unsafe fn as_ascii_unchecked(&self) -> &[ascii::Char; N];
}

// core::char

impl char {
    const fn as_ascii(&self) -> Option<ascii::Char>;
}

// core::num

impl u8 {
    const fn as_ascii(&self) -> Option<ascii::Char>;
}

// core::slice

impl [u8] {
    const fn as_ascii(&self) -> Option<&[ascii::Char]>;
    const unsafe fn as_ascii_unchecked(&self) -> &[ascii::Char];
}

// core::str

impl str {
    const fn as_ascii(&self) -> Option<&[ascii::Char]>;
}

Steps / History

Unresolved Questions

BurntSushi commented 1 year ago

I guess I'll kick off the discussion about how to actually define the ascii::Char type. Right now, it's an enum like this:

#[repr(u8)]
enum Char {
    Null = 0,
    …
    Tilde = 127,
}

I'd like to first make sure I understand the pros of using this representation. Do I have them right?

  1. It provides a niche optimization such that Option<ascii::Char> is the same size as ascii::Char.
  2. It provides a way to write ch as u8 where ch has type ascii::Char.
  3. For "free," you can get cute and write the names of ASCII characters instead of their visual or numerical representation. (I don't mean to trivialize this by calling it "cute." I think it's a nice benefit, especially for non-printable characters.)

Are there more? Am I mistaken about the above?

And now, the mitigations:

  1. I think the niche optimization can still be obtained in this case because std can use unstable features? And I think there's a way to say, "make this specific value the niche" without any other additional cost.
  2. We can provide a ascii::Char::as_u8() method. (Which is already done in the implementation PR.) Is there anything else that being able to write ch as u8 buys us?
  3. I don't think there is a mitigation here. It feels like a "nice to have" aspect of using an enum that we probably wouldn't go out of our way to re-create via other means.

In terms of negatives, why not do the enum in the first place? Good question. I'm not entirely sure. It feels to me like it stresses my weirdness budget. It's also a generally much bigger enum than most, and I wonder whether that will provoke any surprises in places. Maybe not.

ChayimFriedman2 commented 1 year ago

Do we need both u8::as_ascii() and ascii::Char::from_u8()?

scottmcm commented 1 year ago

Regarding the enum, @kupiakos said the following in

The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.

And mentioned icu4x's enum,

https://github.com/unicode-org/icu4x/blob/b6c4018a736e79790898c5b91ff3ab25d33192c2/utils/tinystr/src/asciibyte.rs#L8-L11

Though given those names it's possible it's just doing that for the niche and not for direct use.

(Man I wish we had private variants as a possibility.)

To your questions, @BurntSushi ,

  1. Yeah, there's at least three other possibilites here: a) put the enum in a newtype (like ptr::Alignment) to still get the niches without exposing the variants b) use the rustc-internal magic c) just don't get a niche for now since that's not needed for its core scenario of convertibility to UTF-8 (though it's certainly nice to have)

  2. Given that we're overall not fond of as casts, and it also allows weirder things like as i64, I actually wish the cast could be disabled, and have people use as_u8 as you mention. But that might also just happen by deprecating or warning on these casts, and eventually having ascii::Char: AsRepr<Repr = u8> does seem reasonable.

  3. Now that #105076 has landed, I've been thinking of redoing the internals of those using ascii::Char to try out how it feels using the names. That would mean code like

            '\0' => EscapeDebug::backslash(Null),
            '\r' => EscapeDebug::backslash(CarriageReturn),
            '\n' => EscapeDebug::backslash(LineFeed ),

because without the variants it ends up being more like

            '\0' => EscapeDebug::backslash(b'\0'.as_ascii().unwrap()),
            '\r' => EscapeDebug::backslash(b'\r'.as_ascii().unwrap()),
            '\n' => EscapeDebug::backslash(b'\n'.as_ascii().unwrap()),

which is nice in that it keeps the escape sequences lining up, but having to copy-paste .as_ascii().unwrap() all over like that doesn't fill me with joy.

I guess this is yet another situation in which I'd like custom literals. Or something like a'\n', I guess, but that's way too far a jump to propose right now.

EDIT a while later: The PR to update EscapeDebug like that is #111524

clarfonthey commented 1 year ago

I wasn't in the discussion when this type was created (which I fully support), but is there a reason to explicitly use an enum instead of a newtype + internal niche attribute like NonZeroU8?

If we decide to go the pattern types route (#107606) then NonZeroU8 could become u8 is 1.. and this could become u8 as ..128. Alternatively, if we go the generic integers route (#2581) then this could become uint<7> or just u7.

IMHO, explicitly going with an enum kind of prevents these kinds of optimisations later down the road, since it forces an enum as the representation.

BurntSushi commented 1 year ago

@clarfonthey I think the discussion above answers your question about why we went with the enum for now. In summary, I think its principle upside is that it gives nice names to each character. Not particularly compelling, but nice.

My general feeling is that nobody is stuck on using an enum here. My best guess is that unless something compelling comes up, I wouldn't be surprised if we changed to an opaque representation such that we could change to an enum later in a compatible manner. (I think that's possible?) Purely because it's the conservative route.

I don't grok the second two paragraphs of your comment. Probably because I don't know what "pattern types" are. I also don't know what uint<7> means. (Don't have time to follow your links at the moment.)

clarfonthey commented 1 year ago

I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect. Personally, the benefits I'm seeing here are the ability to have more-performant code without the need for extra unsafe code, since any slice of ASCII characters is automatically valid UTF-8. Otherwise, a type invariant would mostly be unnecessary, and you could just have constants for the names.

Explaining the two proposals:

The pattern type version is very enticing because it very naturally allows exhaustive matching, while still just being a compile-time constraint. And like I said, we could still have constants for the different names for characters without really affecting things.

BurntSushi commented 1 year ago

having nice names is still achievable

Yes, as I said:

I don't think there is a mitigation here. It feels like a "nice to have" aspect of using an enum that we probably wouldn't go out of our way to re-create via other means.

I don't know what the stabilization story is for pattern types or generic integers, but I would prefer we focus on finding a path to stabilization that doesn't require being blocked on hypothetical language features. Or at least, focusing discussion on whether we can utilize them later in a backwards compatible fashion.

I still don't really know what pattern types are (I just don't have time right now to grok it), but generic integer types seem like an implementation detail that's consistent with defining ascii::Char as an opaque type?

clarfonthey commented 1 year ago

Right, the point here is that any opaque type would be equivalent to the primary hypothetical future implementations. So, explicitly not blocking on any particular path forward, but making sure that they're compatible.

scottmcm commented 1 year ago

I mean, having nice names is still achievable with constants, and this still would make match expressions work as you'd expect.

Though you can use a variant, but can't use an associated constant. Thus use ascii::Char::*; works with an enum, but not with constants for the names.

I agree that this could absolutely be done with the magic attributes. But is there anything in particular that you think is better with that approach?

The enum version still gets the niche, notably:

[src/main.rs:3] std::alloc::Layout::new::<Option<Option<Option<Option<Option<std::ascii::Char>>>>>>() = Layout {
    size: 1,
    align: 1 (1 << 0),
}
clarfonthey commented 1 year ago

I will admit that the lack of an ability to use associated constants is an immediate downgrade, although I do think that'll eventually be allowed to eventually deprecate the std::fN::consts modules.

As far as benefits -- the main ones I see are the ability to unify this type with any one of the potential future features I mentioned. If we stabilise an enum now, we are explicitly locked out of this.

That said, there is precedent for separating out the char type from the other integer types, and I guess that ascii::Char fits right in with that. I'm just not convinced that locking in this as an enum is worth losing the benefits we could get later, especially considering how it could be converted from an opaque type into an enum publicly later if we eventually decide that is better than the alternatives.

scottmcm commented 1 year ago

I think it's important that this stay a separate type. To me, it's an advantage that this not just be a u7, but that it have the semantic intent behind it of actually being ASCII.

For example, if we have pattern types, I don't want to impl Display for &[u8 is ..128], but I think it's possible that we'd add impl Display for &[ascii::Char]. (Like how Debug for char is very different from Debug for a hypothetical u32 is ..=1114111.)

Should it be something else to be able to stabilize a subset sooner? Maybe; I don't know. I think I'll leave questions like that more to libs-api.

clarfonthey commented 1 year ago

That's fair, and in that case I'll concede that the enum works in this case. I still am not sure if it's the best-performing version (I use the bounded-integer crate for all sorts of things, and it uses enums as it's codegen) but I feel convinced enough that we can stabilise this as a separate type.

safinaskar commented 1 year ago

I don't like name Char. It is too similar to char, despite these are totally different things. I propose ASCII instead

safinaskar commented 1 year ago

If you don't like name ASCII, then name it Septet, but, please, not Char

BurntSushi commented 1 year ago

I don't think ASCII or Septet are going to work. I can't imagine those being the names we end up with personally.

ascii::Char and char are not totally different things. They both represent abstract characters. One is just a more limited definition than the other, but both are in fact quite limited! ascii::Char is namespaced under the ascii module, which provides pretty adequate context for what kind of character it represents IMO.

clarfonthey commented 1 year ago

Honestly, the name Ascii for a type would make sense since it lives in the ascii module, since that would mirror existing other Rust types. It doesn't seem out of place to refer to an ASCII character as "an Ascii" especially considering how there are already lots of terms that use "ASCII" as a placeholder for "text," as in stuff like "ASCII art." Nowadays, people even refer to Unicode art as ASCII art.

In terms of precedent for the ascii::Char name, I would point toward the various Result types in modules offered across libstd and the ecosystem, where io::Result and fmt::Result are two common examples. While the base Result is in prelude, these separate Result types are in separate modules and often used exclusively as module-specific imports, although in rare cases code will directly import them. I don't think this is a good analogy, however, since ascii::Char and char actually are completely different types with entirely different properties, and not just shorthands for each other, even though you can convert one into the other.

The only real downside I can see to using the name Ascii is that it's converting an acronym into a a word and lowercasing most of the letters, which… didn't stop us for stuff like Arc and Rc and OsStr.

Personally, I've thought that the core::ascii module has been mostly useless until the addition of this character now that AsciiExt is deprecated, although revitalising it to introduce a type called Ascii would make the most sense to me. Plus, it would even lead a path forward to potentially adding this type to prelude if it were considered useful there, or maybe the preludes of other crates that rely on it.

BurntSushi commented 1 year ago

Using Ascii as a singular noun sounds very strange to me. "ASCII art" is using "ASCII" as an adjective. I'm not aware of any wide usage of "Ascii" as a singular noun referring to a single character. Its usage as a noun, in my experience, refers to a character set and/or encoding. Not a member of that set/encoding.

clarfonthey commented 1 year ago

I mean, it does feel weird, but then again, so does calling a vector a "vec" until you say it enough times. If you think of "ASCII" as shorthand for "ASCII character", I don't think that's too far-fetched.

Oh, and even "char" is an abbreviation.

kupiakos commented 1 year ago

2. Given that we're overall not fond of as casts, and it also allows weirder things like as i64, I actually wish the cast could be disabled

I see no issue with this; ascii::Char, or whatever it's called, can be fully represented by every numeric type except bool. The issue with as casts changing the value is irrelevant for this case.

I mean, it does feel weird

This message is composed of 420 ASCIIs.

bluebear94 commented 1 year ago

I also support calling this type Ascii because calling it Char is likely to give worse diagnostics if you mix up char and Char. The Char name would also result in confusion to readers unless you qualify it as ascii::Char, which is more verbose than Ascii.

Also, rustdoc currently doesn’t play well with types that are exported under a different name from their declaration; for instance, this method shows the return type as AsciiChar, which is the name used in the declaration of this type.

(Note: I dislike the convention of having multiple types named Error and Result in the standard library and would rather have had FmtError, IoError, and so on, so my opinions might be biased.)

clarfonthey commented 1 year ago

I'm actually kind of fine with io::Result and io::Error since you tend to have lots of different error and result types, whereas something as fundamental as ascii::Char feels more like something you'd want to avoid qualifying on import.

Hence why I think that exporting it as just Ascii is best, since AsciiChar feels "redundant" in some sense. Even though "Ascii" by itself could refer to a string, since the string version is so easy to create ([Ascii]), it feels okay. Plus, it means eventually it could be even included in prelude, if we felt like it.

joshtriplett commented 1 year ago

FWIW, I really don't think we should expose these character names as enum variants to the user. We should encourage just using character constants; if we need a way to write this type directly (rather than b'.' which produces a u8), we should consider giving a special character syntax for that (e.g. a'.') rather than named enum variants.

joshtriplett commented 1 year ago

Using Ascii as a singular noun sounds very strange to me. "ASCII art" is using "ASCII" as an adjective. I'm not aware of any wide usage of "Ascii" as a singular noun referring to a single character. Its usage as a noun, in my experience, refers to a character set and/or encoding. Not a member of that set/encoding.

Agreed. calling it AsciiChar seems preferable, to give clear error messages and give a type that makes sense unqualified.

BurntSushi commented 1 year ago

Using Ascii as a singular noun sounds very strange to me. "ASCII art" is using "ASCII" as an adjective. I'm not aware of any wide usage of "Ascii" as a singular noun referring to a single character. Its usage as a noun, in my experience, refers to a character set and/or encoding. Not a member of that set/encoding.

Agreed. calling it AsciiChar seems preferable, to give clear error messages and give a type that makes sense unqualified.

I still personally like ascii::Char. It avoids the redundancy in naming. I concede that its unqualified use may be confusing, but I think the antidote to that is to use its qualified form.

kupiakos commented 1 year ago

FWIW, I really don't think we should expose these character names as enum variants to the user. We should encourage just using character constants; if we need a way to write this type directly (rather than b'.' which produces a u8), we should consider giving a special character syntax for that (e.g. a'.') rather than named enum variants.

I agree for eventual stabilization that a dedicated a'.' for AsciiChar and a"." for &[AsciiChar; N] would be best, but for now, in nightly, these variants should be exposed until such a syntax is accepted and merged. Otherwise, there's no infallible way to construct an AsciiChar and no easy way to match on one; you have to deal with optimized-away AsciiChar::from_u8(b'A').unwrap() or match on u8 instead of AsciiChar.

kupiakos commented 1 year ago

Agreed. calling it AsciiChar seems preferable, to give clear error messages and give a type that makes sense unqualified.

I'm pro-AsciiChar though not anti-ascii::Char. A name that's easier to understand unqualified seems less risky for the stdlib.

kupiakos commented 1 year ago
    const fn as_u8(self) -> u8;
    const fn as_char(self) -> char;

Hmm, shouldn't these be named to_? The API guidelines describe as as used for borrowed -> borrowed operations, and to_ for owned -> owned operations on Copy types. Also see:

joshtriplett commented 1 year ago

Zero objection to these variants being exposed in nightly, just to the idea of stabilizing them.

scottmcm commented 1 year ago

I see no issue with this; ascii::Char, or whatever it's called, can be fully represented by every numeric type except bool.

This is an interesting point. Using x as ffi::c_char would work properly no matter whether char is signed or unsigned on the target, for example. (Well, barring ebcdic, I guess, if you care about such things.)

ChrisDenton commented 1 year ago

One thing that occurs to me that it might be useful to have encode_utf16 implemented for [AsciiChar]. Because that's simply u8 -> u16 for each char.

clarfonthey commented 1 year ago

Isn't c.as_u8() as u16 good enough for that?

Ideally we'd be able to just say c as u16 but I'm not sure if that works at the moment.

ChrisDenton commented 1 year ago

For sure, but it occurred to me that people would reach for .as_str().encode_utf16() as the most convenient way unless they knew about UTF-16 encoding and were thinking about perf.

clarfonthey commented 1 year ago

re: #113295, Step is now implemented for ascii::Char as well.

Kimundi commented 1 year ago

About naming: I agree that a short unique unqualified name would probably be best name, due the expected usage and the possibility of having it in the prelude at some point. However, seeing how Ascii and Char are two extremes with issues, how about we split the difference and call it Achar ? Would not be the first type that had the compromise of using a single character prefix to differentiate it: iN vs uN, C/C++s char, wchar_t, char8_t, char16_t, char32_t, etc.

We could also maybe add a type Astr = [Achar]; in the future then.


About the Implementation that currently is in std: I see that Debug is derived on the enum, while Display use the string repr. Would it not make sense to have the Debug output be like for char?

Consider a Debug printout of a [ascii::Char] - what is more useful:

bluebear94 commented 1 year ago

I disagree with calling it Achar (though I at least prefer it to Char), but I agree with changing the Debug impl to print the string repr.

rben01 commented 1 year ago

If going the Achar route, I think it should instead be spelled AChar because “ASCII” and “Char” are two separate words. Or even achar, all lowercase, as this type feels pretty primitive. That said, I'd prefer AsciiChar (or even Char7 😱). Not a fan of Ascii — it sounds like it refers to an ASCII string, not a single char.

Regarding Debug printing, wrapping the result of char::escape_default in single quotes seems like it'd be optimal (except maybe " should be '"' instead of '\"'), regardless of the underlying implementation of ascii::Char.

One question I have is if it would be possible to match against ranges of ascii::Char. I could see something like match c { Char::CapitalA..=Char::CapitalZ => ..., /* other cases here */ } being useful, but currently range patterns only support char and numeric types. Without this, I could see people falling back to match c.as_u8() { b'A'..=b'Z' => ..., /* other cases here */, 128..=255 => unreachable!() } to satisfy the compiler.

EDIT: Is there a reason that matching against any range pattern a..=b of a PartialOrd type isn't just equivalent to x if (a..=b).contains(&x)?

AngelicosPhosphoros commented 10 months ago

EDIT: Is there a reason that matching against any range pattern a..=b of a PartialOrd type isn't just equivalent to x if (a..=b).contains(&x)?

Yes, range based match allows to check matches for totality.

clarfonthey commented 10 months ago

You can't match ranges of any custom-defined type at the moment because, while we do have a mechanism for doing this (see: #74446) with value matches (deriving Eq will let you do this), it does not work for range matches (there's no equivalent for deriving Ord).

I do think that it would be nice to have ranges work for this, though, especially given how the values are effectively integers. It's quite unfortunate that you have to call as_u8 in order to make ranges work, since it means that you always need to add an unreachable branch for 128..=255.

Finomnis commented 10 months ago

Missing for usage as generic string parameter:

llogiq commented 10 months ago

Also missing:

robertbastian commented 10 months ago

I haven't seen these mentioned: for this type to be a general replacement for u8 in ASCII contexts, it would need these methods from u8 as well:

Also, something like

const &[Char; N] -> &[u8; N];
const [Char; N] -> [u8; N];

for dealing with fixed sizes, which often happens in protocols.

Sky9x commented 10 months ago

Thoughts on ascii string/char literals? eg.

const CAPITAL_A: AsciiChar = a'A';
const HEX_DIGITS: &[AsciiChar; 16] = a"0123456789abcdef";

behavior would match that of byte [string] literals, except:

this could be nice to have to avoid "string".as_ascii().unwrap()

(would this require an RFC?)

EDIT: Pre-RFC: https://internals.rust-lang.org/t/pre-rfc-ascii-type-literals/19960

Lucretiel commented 9 months ago

Nitpick: probably most of the new methods on u8 and char and so on should be by self rather than &self

kupiakos commented 9 months ago

Various other valuable conversions that I don't see described:

clarfonthey commented 9 months ago

re: #113295, Step is now implemented for ascii::Char as well.

Just want to point this out again since the original description hasn't been edited.

mina86 commented 9 months ago

How about from_digit method, see https://github.com/rust-lang/rust/pull/118963. Though admittedly, @Sky9x’s proposal of a prefix would alleviate need for it since writing a"0123456789abcdef"[digit] isn’t that bad.

kupiakos commented 9 months ago

How about from_digit method, see #118963. Though admittedly, @Sky9x’s proposal of a prefix would alleviate need for it since writing a"0123456789abcdef"[digit] isn’t that bad.

There's currently a ascii::Char::digit method, but its API doesn't match char::from_digit with its radix parameter. ascii::Char also has a digit_unchecked which doesn't have an equivalent in char.

Unless there's any opposition, I propose that:

The value of consistency in the stdlib cannot be overstated.

scottmcm commented 8 months ago

For the naming discussion, one of the tests has output like this:

---- [ui] tests/ui/try-trait/bad-interconversion.rs stdout ----
diff of stderr:

12    = help: the following other types implement trait `From<T>`:
13              <u8 as From<bool>>
14              <u8 as From<NonZeroU8>>
+               <u8 as From<Char>>
15    = note: required for `Result<u64, u8>` to implement `FromResidual<Result<Infallible, i32>>`

And so long as the compiler is showing that as just "<u8 as From>", that's a bit awkward to read, and might be evidence that it would be nice for the type name, not just the namespace, to be clearer here.

Of course, that might also be a "gee, it would be nice to have an attribute on types to suggest showing with the module", which fmt::Result and such might also want to use. Or for the compiler's "should I show more info" disambiguating logic to be case-insensitive...

mina86 commented 8 months ago

And so long as the compiler is showing that as just <u8 as From<Char>>, that's a bit awkward to read, and might be evidence that it would be nice for the type name, not just the namespace, to be clearer here.

Note though that error messages can be changed while APIs cannot. Long-term, a better API is more important than better diagnostics. (I have no strong opinions on naming though I lean towards ascii::Char since I dislike when symbols include module they belong to in their names).

clarfonthey commented 8 months ago

My main motivation for naming it Ascii instead of Char or AsciiChar is because it doesn't need the module to describe what it is, but also isn't overly long. Most people will be importing it to use without a module, and it should describe what it is fully in these cases.

Generally, the only cases where types use ambiguous names are when tey're specific to a module and not useful outside it. For example, io::Result and io::Error are not general-purpose types for any program, but only for I/O. There's also the case of types that are rarely named, like array::IntoIter and slice::Iter, which are almost always used implicitly as the result of calling methods.

In this case, Char doesn't fit this, really. ASCII text is a fundamental data type meant to be used outside the ascii module. Many people will be directly naming this type in their code and shouldn't have to distinguish char and Char, or use the module as ascii::Char.

Maybe another example is types which have the same name but aren't meant to be used together, e.g. types in sync and async modules. Here, you generally want exactly one module's types, and code that uses both will often be a library trying to provide compatibility rather than any one program.

So, it's this reason I support Ascii, AChar, or AsciiChar, in that order.

Maybe even a better example is BTreeMap and HashMap; we could make these btree::Map and hash::Map but we don't because perfectly reasonable programs might use both.