m4rw3r / chomp

A fast monadic-style parser combinator designed to work on stable Rust.
Apache License 2.0
244 stars 19 forks source link

More comprehensive examples #47

Closed josephsavona closed 8 years ago

josephsavona commented 8 years ago

I've looked at a bunch of libraries for parsing in Rust and chomp's API feels most intuitive based on small examples. However, I've found myself struggling to make even a simple parser work in practice. It's pretty easy to mess up a macro and have the compiler complain, for example, and I'm still not sure how to parse something like "1.0" into an f64 efficiently and correctly.

Having a more complete example would be super helpful to people approaching the library. Any format with a reasonably wide variety of data types (strings, ints, floats) would be great - maybe JSON?

I'll submit a PR for something like this if/when I make enough progress.

m4rw3r commented 8 years ago

The f64 and f32 are somewhat easy to parse provided you do not care about adhering to the IEEE standard, but then you will end up with values which can be wildly inaccurate due to how floating-point arithmetic works. There are some pretty complex algorithms which will help with selecting the closest possible IEEE representation of the parsed float (rust uses them internally for parsing floats, see libcore::num::dec2flt, exposed in FromStr implementations on f32 and f64).

To produce an accurate float-parser these algorithms need to be re-implemented since rust does not expose this interface and requires a str. Or one could implement it as parsing the number into a Vec<u8>, then convert this to a String which is then parsed using the FromStr implementation which uses the dec2flt module. The latter will require an allocation on each attempt to parse, sadly, if we wish to be generic over Input types.

Here is an example of the inaccurate version of an f32 parser, written with the impl Trait version due to it being slightly easier to read since there is no Input to keep track of (and it doesn't handle exponents either):

pub fn float<I: Input<Token=u8>>() -> impl Parser<I, Output=f32, Error=Error<u8>> {
    option(satisfy(|c| c == b'-' || c == b'+').map(|s| s == b'+'), false).bind(move |negative|
        decimal().bind(move |integer: u64|
            option(token(b'.').then(take_while1(is_digit)).map(move |fraction_buf: I::Buffer| {
                let (frac, pow): (f32, i32) = fraction_buf.fold((0.0, 0), |(f, p), c| (f * 10.0 + (c - b'0') as f32, p + 1));

                (integer as f32) + frac * (10.0f32).powi(-pow)
            }), integer as f32)).map(move |f| if negative { -f } else { f }))
}

About the macro, any specific pain points? I could try to provide more thorough troubleshooting help in the rustdocs for the parse! macro (sadly rust macro-errors can be pretty cryptic, especially when tt-munching).

As for more complete examples, I plan on implementing some, but I do not know what yet. It will probably be a TOML or JSON parser.

m4rw3r commented 8 years ago

@josephsavona I have now added a basic JSON-parser here: examples/json.rs.

It is not yet complete, missing string-escape parsing (and currently just does some unsafe and assumes correct UTF-8 for strings). The float-parsing also needs to be "upstreamed" to the ascii module of Chomp once some cleanup has been made, it is not very elegant since it parses a float-formatted string and then feeds Rust's FromStr implementation for f64 (which also in this case necessitates an allocation since we cannot rely on the Input implementation being a slice, if specialization is present we can probably specialize for this case though which will avoid an allocation).

josephsavona commented 8 years ago

@m4rw3r awesome! The JSON example is really helpful, and the use of FromStr seems reasonable given that the allocation cost is only paid when a float is encountered (?). Thanks for the follow up and example!

m4rw3r commented 8 years ago

@josephsavona An update on float-parsing, Rust's built-in FromStr is slow for inputs with many significant digits, completely dominating the Chomp-part of parsing a float (ie. make sure it matches the float-format, allocate a temporary Vec<u8> for it):

test float_f32                 ... bench:      30,907 ns/iter (+/- 6,295)
test float_f64                 ... bench:       1,038 ns/iter (+/- 93)
test float_no_conversion       ... bench:          81 ns/iter (+/- 8)
test float_small_f32           ... bench:          59 ns/iter (+/- 20)
test float_small_f64           ... bench:          56 ns/iter (+/- 4)
test float_small_no_conversion ... bench:          28 ns/iter (+/- 5)
test from_str_f32              ... bench:      30,717 ns/iter (+/- 2,787)
test from_str_f64              ... bench:         939 ns/iter (+/- 65)
test from_str_small_f32        ... bench:          26 ns/iter (+/- 4)
test from_str_small_f64        ... bench:          27 ns/iter (+/- 6)
test match_float               ... bench:          59 ns/iter (+/- 7)

A few notes:

This is all if you want to have precise representations of floats, if that is not a concern then you can use the parser I wrote above (or a variant of it) where it will just sum the digits one by one :)