Open Kixunil opened 2 months ago
We had a similar bug in rust-bech32 in our new Hrp::from_display
API. Fortunately we caught it before releasing.
Agreed with your strategy. Ideas for the new method name
invalid
invalid_character
character
char_
(lol)character
sounds least bad to me.
I had a play with this which led me to the view that
I hacked up a validate_hex_string
function but it is so trivial and undiscoverable that users will likely just get stuck debugging then write their own when they think of it.
/// Validates that input `string` contains valid hex characters.
pub fn validate_hex_string(s: &str) -> Result<(), NonHexDigitError> {
for c in s.chars() {
if !c.is_ascii_hexdigit() {
return Err(NonHexDigitError { invalid: c});
}
}
Ok(())
}
Perhaps a middle ground would be to add this to crate docs under some sort of # Debugging
section?
UTF-8 is designed so that as soon as you hit a non-ASCII byte you know that it's the start of a non-ASCII character. So we shouldn't need to use the slow chars
iterator or anything. Just, when we hit a non-ASCII character then we need to maybe do some slow stuff to extract the full character that we've run into. (Taking s[pos..].chars().next().unwrap()
would work for example, if we know the byte position of the bad character. And knowing this won't slow us down at all.)
That sounds feasible, I'll have a play with it. Thanks
Another possible approach that doesn't require explicitly tracking the position and relies on iterators only:
// We're not using `.bytes()` because it doesn't have the `as_slice` method.
let mut bytes = s.as_bytes().iter();
// clone() is actually a copy and it allows peeking behavior without losing access to the `as_slice` method on the iterator.
while let Some(byte) = bytes.clone().next() {
match decode_hex_digit(byte) {
Some(value) => { /* do something with the value here */ },
None => {
// SAFETY: all bytes up to this one were ACII which implies valid UTF-8 and the input was valid UTF-8 to begin with so splitting the string here is sound. We're also making sure we split it at correct position by cloning the iterator.
let remaining_str = unsafe { core::str::from_utf8_unchecked(bytes.as_slice()) };
return Err(InvalidChar { c: remaining_str.chars().next().unwrap(), pos: s.len() - remaining_str.len() });
},
}
bytes.next()
}
:eyes: can you really call .next()
directly on a byteslice like that?
LOl, I forgot to write .iter()
. :D
Ah! The .iter()
is on bytes
. Gotcha. Yep, that would work. I don't feel strongly in favor of either solution.
FTR I was also considering having enum InvalidChar { Utf8(char), Other(u8) }
to allow parsing raw &[u8]
slices without prior UTF-8 validation (e.g. when parsing from stdin
). But I think it's better to have separate types for those cases.
But I think it's better to have separate types for those cases.
What do you mean by this?
Basically if we ever add functions that take &[u8]
we would make them return separate error types to indicate that the bytes may not be UTF-8.
Ah. yeah, that's probably best.
If the string contains a multi-byte UTF-8 char then the error returns confusing value unsuitable for the user.
This change is kinda breaking but we can phase it slowly: add
invalid_char_human(&self) -> char
method (please invent a better name!), implementchar
->u8
conversion insideinvalid_char
by encoding the first byte of the char and deprecate it.I suspect this may need a bunch of changes to be able to get the entire char though.