Here are a few things you should provide to help me understand the issue:
Rust version : rustc 1.82.0 (f6e511eec 2024-10-15)
nom version : 7.1.3
nom compilation features used: none
Suggestion
anychar docs say:
Matches one byte as a character. <Note that the input type will accept a str, but not a &[u8], unlike many other nom parsers.>
There's a few issues with this:
Documentation is unclear:
Phrasing "Matches one byte as a character" directly contradicts the fact that the function operates on str which uses UTF-8 variable-length encoding. It's actually "matching one character from a string" which can be multiple bytes.
It's almost trivial to support &[u8]:
As you can see on wiki, number of bytes a UTF-8 character is supposed to consume is equal to number of preceding 1 bits (or just 1 if it's 0). After consuming the correct number of bytes, they can be passed into str::from_utf8 for further validation. This would work for both complete and streaming implementations because one can tell there's bytes missing based on first byte in the sequence.
Here's the logic for consuming the correct number of bytes from binary stream:
let length = if bytes[i] & 0b1000_0000 == 0 {
1 // ASCII (1 byte)
} else if bytes[i] & 0b1110_0000 == 0b1100_0000 {
2 // 2-byte UTF-8 character
} else if bytes[i] & 0b1111_0000 == 0b1110_0000 {
3 // 3-byte UTF-8 character
} else if bytes[i] & 0b1111_1000 == 0b1111_0000 {
4 // 4-byte UTF-8 character
} else {
// not UTF-8
}
After consuming the correct number of bytes, passing to str::from_utf8 is necessary because even though the sequence is of correct length, not every sequence maps to a valid character so it's better to delegate this to std-lib. There's a bit of waste here because str::from_utf8 checks for length as well, so it's better to keep the &str implementation as is. I'm not familiar enough with the library to be able to tell whether you have some trait that allows you to use input type as discriminant for implementations but I think that would be the ideal approach.
Prerequisites
Here are a few things you should provide to help me understand the issue:
Suggestion
anychar
docs say:There's a few issues with this:
str
which uses UTF-8 variable-length encoding. It's actually "matching one character from a string" which can be multiple bytes.&[u8]
:1
bits (or just 1 if it's 0). After consuming the correct number of bytes, they can be passed intostr::from_utf8
for further validation. This would work for bothcomplete
andstreaming
implementations because one can tell there's bytes missing based on first byte in the sequence.Here's the logic for consuming the correct number of bytes from binary stream:
After consuming the correct number of bytes, passing to
str::from_utf8
is necessary because even though the sequence is of correct length, not every sequence maps to a valid character so it's better to delegate this to std-lib. There's a bit of waste here becausestr::from_utf8
checks for length as well, so it's better to keep the&str
implementation as is. I'm not familiar enough with the library to be able to tell whether you have some trait that allows you to use input type as discriminant for implementations but I think that would be the ideal approach.