Return `char` instead of `u8` from `InvalidCharError::invalid_char`

rust-bitcoin / hex-conservative

Hex crate with a conservative MSRV and dependency policy

Creative Commons Zero v1.0 Universal

8 stars 9 forks source link

Return `char` instead of `u8` from `InvalidCharError::invalid_char` #100

Open Kixunil opened 2 months ago

Kixunil commented 2 months ago

If the string contains a multi-byte UTF-8 char then the error returns confusing value unsuitable for the user.

This change is kinda breaking but we can phase it slowly: add invalid_char_human(&self) -> char method (please invent a better name!), implement char -> u8 conversion inside invalid_char by encoding the first byte of the char and deprecate it.

I suspect this may need a bunch of changes to be able to get the entire char though.

apoelstra commented 2 months ago

We had a similar bug in rust-bech32 in our new Hrp::from_display API. Fortunately we caught it before releasing.

Agreed with your strategy. Ideas for the new method name

invalid
invalid_character
character
char_ (lol)

Kixunil commented 2 months ago

character sounds least bad to me.

tcharding commented 1 month ago

I had a play with this which led me to the view that

Adding checks for utf-8 slows down parsing (because without some complex trickery we need to parse the input string twice)
Giving a nice error message for utf-8 adds either complexity to the lib or a perf hit or both
Parsing utf-8 (non ascii) characters is an edge case not the normal case
We should not have a perf hit or unnecessary complexity for an edge case

I hacked up a validate_hex_string function but it is so trivial and undiscoverable that users will likely just get stuck debugging then write their own when they think of it.


/// Validates that input `string` contains valid hex characters.
pub fn validate_hex_string(s: &str) -> Result<(), NonHexDigitError> {
    for c in s.chars() {
        if !c.is_ascii_hexdigit() {
            return Err(NonHexDigitError { invalid: c});
        }
    }
    Ok(())
}

Perhaps a middle ground would be to add this to crate docs under some sort of # Debugging section?

apoelstra commented 1 month ago

UTF-8 is designed so that as soon as you hit a non-ASCII byte you know that it's the start of a non-ASCII character. So we shouldn't need to use the slow chars iterator or anything. Just, when we hit a non-ASCII character then we need to maybe do some slow stuff to extract the full character that we've run into. (Taking s[pos..].chars().next().unwrap() would work for example, if we know the byte position of the bad character. And knowing this won't slow us down at all.)

tcharding commented 1 month ago

That sounds feasible, I'll have a play with it. Thanks

Kixunil commented 1 month ago

Another possible approach that doesn't require explicitly tracking the position and relies on iterators only:

// We're not using `.bytes()` because it doesn't have the `as_slice` method.
let mut bytes = s.as_bytes().iter();
// clone() is actually a copy and it allows peeking behavior without losing access to the `as_slice` method on the iterator.
while let Some(byte) = bytes.clone().next() {
    match decode_hex_digit(byte) {
        Some(value) => { /* do something with the value here */ },
        None => {
            // SAFETY: all bytes up to this one were ACII which implies valid UTF-8 and the input was valid UTF-8 to begin with so splitting the string here is sound. We're also making sure we split it at correct position by cloning the iterator.
            let remaining_str = unsafe { core::str::from_utf8_unchecked(bytes.as_slice()) };
            return Err(InvalidChar { c: remaining_str.chars().next().unwrap(), pos: s.len() - remaining_str.len() });
        },
    }
    bytes.next()
}

apoelstra commented 1 month ago

:eyes: can you really call .next() directly on a byteslice like that?

Kixunil commented 1 month ago

LOl, I forgot to write .iter(). :D

apoelstra commented 1 month ago

Ah! The .iter() is on bytes. Gotcha. Yep, that would work. I don't feel strongly in favor of either solution.

Kixunil commented 1 month ago

FTR I was also considering having enum InvalidChar { Utf8(char), Other(u8) } to allow parsing raw &[u8] slices without prior UTF-8 validation (e.g. when parsing from stdin). But I think it's better to have separate types for those cases.

apoelstra commented 1 month ago

But I think it's better to have separate types for those cases.

What do you mean by this?

Kixunil commented 1 month ago

Basically if we ever add functions that take &[u8] we would make them return separate error types to indicate that the bytes may not be UTF-8.

apoelstra commented 1 month ago

Ah. yeah, that's probably best.