rust-bakery / nom

Rust parser combinator framework
MIT License
9.38k stars 804 forks source link

Is there a good way? #1494

Open jellybobbin opened 2 years ago

jellybobbin commented 2 years ago
enum Tokens<'a>{
    Words(&'a str),
    Spaces(usize),
    Return,
    NewLine
}

let input = "\"H\\u{65}llo \\u{20} rust\\n"\";

I want this result:

Vec[
    Tokens::Words("Hello"),
    Tokens::Spaces(3),
    Tokens::Words("rust"),
    NewLine
]

Is this feasible?Here is the simple code:

pub fn parse_token(input: &str) -> IResult<&str, Vec<Tokens>> 
{
    many1(get_token)(input)
}

fn get_token(input: &str) -> IResult<&str, Tokens>
{
    alt((
        //only return Tokens::Words("H")
        map(alpha1, Tokens::Words),
        //here is only return a char, It doesn't work well `&str`
        map(parse_escaped_char, Tokens::CJKString),
    ))(input)
}

pub fn parse_escaped_char<'a, E>(input: &'a str) -> IResult<&'a str, char, E>
where
  E: ParseError<&'a str> + FromExternalError<&'a str, std::num::ParseIntError>,
{
    preceded(
        char('\\'),
         alt((
             parse_unicode,
             value('\n', char('n')),
        )),
  )(input)
}

fn parse_unicode<'a, E>(input: &'a str) -> IResult<&'a str, char, E>
where
  E: ParseError<&'a str> + FromExternalError<&'a str, std::num::ParseIntError>,
{
    let parse_hex = take_while_m_n(1, 6, |c: char| c.is_ascii_hexdigit());

    let parse_delimited_hex = preceded(
        char('u'),
        delimited(char('{'), parse_hex, char('}')),
    );

    let parse_u32 = map_res(parse_delimited_hex, move |hex| u32::from_str_radix(hex, 16));

    map_opt(parse_u32, |value| std::char::from_u32(value))(input)
}

I'll close it as soon as possible, thx!!!

Xiretza commented 2 years ago

I think the main problem is that your Tokens::Words contains a &str, which means it references a direct slice of the input. That's not what you want though, you want to apply transformations to the input (unescaping unicode escapes), so you'll have to copy the data into a String.

jellybobbin commented 2 years ago

@Xiretza

Even if I don't use it & str, the parser returns char. When the input is escape Unicode, it cannot become a continuous string;

when let input = "\"H\\u{65}llo \\u{20} rust\\n"\"; I want get:

Vec[
    Tokens::Words(String::from("Hello")),
    Tokens::Spaces(3),
    Tokens::Words(String::from("rust")),
    NewLine
]

but not:

Vec[
    Tokens::Words(String::from("H")),
    Tokens::Words(String::from("e")),
    Tokens::Words(String::from("llo")),
    Tokens::Spaces(1),
    Tokens::Spaces(1),
    Tokens::Spaces(1),
    Tokens::Words(String::from("rust")),
    NewLine
]
knarkzel commented 2 years ago

You can do post-parsing transformations on it, for instance