rust-bakery / nom

Rust parser combinator framework
MIT License
9.47k stars 805 forks source link

Can nom do UTF8 Unicode? #1559

Open joelparkerhenderson opened 2 years ago

joelparkerhenderson commented 2 years ago

I'm interested in nom with various character encodings especially UTF8 Unicode, UTF16 Unicode, etc.

Example:

nom::character::is_alphabetic states "Tests if byte is ASCII alphabetic: A-Z, a-z"

Is there a goal/plan/timline to make nom savvy about UTF8 Unicode and/or UTF16 Unicode?

Some of the nom documentation is unclear to me, and I'm seeking guidance please:

mskorkowski commented 2 years ago

Nom is about bytes, which is lower level then utf8/16/...

I'm using nom to parse the utf8 files and it works great. For it just use &str as input. Iteration over &str yields char. If you need more general classes of characters then ascii, rust char contains is_* methods.

For utf16 and other encodings you have two paths:

If you need utf* clusters then you need to do some extra work for grouping since rust doesn't support utf8 clusters out of the box (once again, there are crates for this).

joelparkerhenderson commented 2 years ago

Thank you, that's good info. Based on what you write "Nom is about bytes", what I see (IMHO) it there seem to be three different ideas in Nom currently: bytes, ASCII, and some UTF8 such as your example of &str char.

What could it possibly look like to have a new crate "nom-utf8" that is always about UTF8 characters, never about ASCII?

For example, "alphanumeric" would always mean UTF8 alphanumeric and never mean ASCII alphanumeric.

I believe this new crate could be very useful. If other people here feel similarly, and the Nom leadership would be open to considering the idea as an eventual Nom crate, then I would be open to working on creating it, testing it, etc.

mskorkowski commented 2 years ago

bytes, ASCII, and some UTF8 such as your example of &str char.

  1. All three of them are just different interpretation of bytes. ASCII is u8. String and friends is a wrapper around Vec<u8> so it's still about bytes.
  2. What does it mean "utf8 only"? There is a great chapter about indexing into a string in rust book which shows how you end up with 3 different interpretations just for utf8. Simplest one being bytes.
  3. For some languages some concept's are bit fuzzy. To not search far apostrophe can be either punctuation mark, diacritical or a letter. If you dig into it, just this one symbol is a nightmare for interpretation
  4. What's more utf8 changes yearly. C# had a case where you could write valid code in one version of utf8 which was still valid in next version of utf8 but printed different result. Issue was related to changing character class to white space for one of the character.

If you write a proper parser which is supposed to deal seriously with utf8 you must know much more about the file format, utf8 and accepted language of the content then whatever trivial function can provide.


Indexing over String Apostrophe on wiki Category change for latin-1 in .net 5

joelparkerhenderson commented 2 years ago

Right you are. Edge cases are plentiful.

What does it mean "utf8 only"?

For example in what I'm suggesting for a "nom-utf8" kind of crate:

  1. The functions and documentation would use Unicode terminology such as "number" to mean a UTF8 number, such as the many different sets of graphemes for the decimal digits. And there would be no "number" terminology for exactly ASCII 0-9.

  2. Parser matching lengths would be expressed in UTF8 character length, such as a UTF8 two-byte character having length 1, not 2. If I recall correctly, this makes string length calculations quite different and may need full-length scans to calculate the correct UTF8 character length.

If you dig into it, just this one symbol is a nightmare for interpretation.

Yes you're exactly right, that's why I want a crate solution. For example, my current code has a lot of these symbols.

pub fn unicode_any_colon(input: &str) -> nom::IResult<&str, &str> {
    nom::branch::alt((
        nom::bytes::complete::tag(":"), // U+003A COLON
        nom::bytes::complete::tag(":"), // U+FF1A FULLWIDTH COLON
    ))(input)
}

For some languages some concept's are bit fuzzy.

Yes. IMHO your point is a strong one for encapsulating the fuzzy areas in one crate, so each developer doesn't need to roll their own. Crate documentation of the fuzzy areas can help too.

What's more utf8 changes yearly.

Yes you're expressing why I really want this kind of logic in a crate. Ideally the crate could document the changes and/or have versioning that aligns with the changes.

Ideally developer can choose which version of representation is needed. For example: "If you need Unicode version x.y.z, then code it as nom::configure("x.y.z"). That kind of versioning sounds like a good long range goal, even if it can't be ready right away.

My primary need in my own code is for UTF8 Spanish for accented letters. Perhaps this could be good starting point? Meaning, create a crate that can parse current UT8 (i.e. no Unicode version specification) for current Spanish (i.e. no other languages). Once that's ready, then iterate by adding languages, character sets, symbols, etc.

Thanks again for your thoughtful comments. I appreciate your help!