uiri / SEGIMAP

IMAP (and LMTP) Server written in Rust
MIT License
32 stars 6 forks source link

Replace PEG with nom for parsing #18

Closed indiv0 closed 7 years ago

indiv0 commented 7 years ago

This PR replaces the existing PEG parser grammar with a parser written using the nom parser combinator framework. The PR builds upon #17, and should not be merged prior to #17.

In terms of performance, for the limited portion of the parser ported so far, nom appears to be somewhat faster slower (3-5%):

    #[bench]
    fn bench_nom_sequence_set(b: &mut Bencher) {
        b.iter(|| super::sequence_set(b"1,1,*,3,4:8,4:5,4:*,*:3"));
    }

    #[bench]
    fn bench_peg_sequence_set(b: &mut Bencher) {
        b.iter(|| super::grammar::sequence_set("1,1,*,3,4:8,4:5,4:*,*:3"));
    }
test parser::tests::bench_nom_sequence_set ... bench:       4,660 ns/iter (+/- 50)
test parser::tests::bench_peg_sequence_set ... bench:       4,530 ns/iter (+/- 40)

Note that the nom parsing code has not been optimized at all, and may yield significant performance improvements upon optimization.

NOTE: this PR is currently a work-in-progress, and only part of the parser has been ported. DO NOT MERGE. EDIT: I had the benchmark function names mixed up. Updated.


This change is Reviewable

indiv0 commented 7 years ago

PR should now be ready for review.

Module layout

The latest commits strip the PEG parser and any redundant test cases out, and also re-arrange the layout of the parser module. The individual grammar rules are located in the parser::grammar module, which is intended to be a private module.

In the parser::grammar module there are currently two submodules, fetch and sequence. These provide the grammar rules for parsing FETCH commands and the sequence-set rule from RFC3501. The reason these modules are separate is to ensure a logical separation between different groups of grammar rules (e.g. the FETCH rules will never need to be used by other grammar rules, so they can be isolated), keeping only the "base" grammar rules (e.g. string and number parsing) in the parser::grammar module itself.

The reason the grammar rules are located in the parser::grammar submodule is so that the parser module itself can provide a stable API to the grammar rules for when the module gets extracted into a separate crate. This will allow us to modify rules internally (or even change parser frameworks) without having to modify the public API. Currently this just consists of a wrapper function for the FETCH parser, which doesn't even forward error cases correctly yet. This will be fixed in later revisions with a proper error type and additional APIs for other rules as they get added.

Benchmark

For future reference, prior to deleting the PEG parser I compared the performance of the PEG FETCH command parser and its initial nom equivalent:

test parser::tests::bench_fetch_nom ... bench:       2,955 ns/iter (+/- 146)
test parser::tests::bench_fetch_peg ... bench:      10,549 ns/iter (+/- 181)

The nom parser completes in ~30% the time of the PEG parser.

uiri commented 7 years ago

I'm not sure that I understand all of the macro soup that is the nom parsing but most of it is straightforward enough that the hairier ones look like they should be fine.

I'm thinking that we should resolve the TODOs before merging. I've noted them to make it easy for you to find and for me to track in reviewable

A couple nits in command::fetch since the code there might end up being reverted back anyways.

Everything else looks good.


Reviewed 10 of 10 files at r2. Review status: all files reviewed at latest revision, 8 unresolved discussions.


src/command/fetch.rs, line 14 at r2 (raw file):

/// Take the rest of the arguments provided by the client and parse them into a
/// Command object with command::fetch.
pub fn fetch(args: Split<char>) -> Result<Command, ()> {

I take it () should be reversed back to parser::ParseError ?


src/command/fetch.rs, line 21 at r2 (raw file):

    }

    parser::fetch(cmd.as_bytes())

Is there a reason to change the external API instead of just grabbing the bytestream inside of the fetch function?


src/parser/mod.rs, line 10 at r2 (raw file):

    match self::grammar::fetch(input) {
        Done(_, v) => Ok(v),
        // TODO: handle the error and incomplete cases properly (possibly via a

Error handling TODO


src/parser/grammar/fetch.rs, line 180 at r2 (raw file):

);

// TODO: confirm this only needs to work for ASCII

According to the RFC, it MUST be ASCII. I think the way it is right now is fine?


src/parser/grammar/fetch.rs, line 192 at r2 (raw file):

// a non-matching byte. This is useful for streaming parsers which may be
// awaiting further input.
// TODO: decide if this should be streaming-compatible. If not, add a

Streaming(?) TODO


src/parser/grammar/mod.rs, line 13 at r2 (raw file):


fn is_astring_char(chr: u8) -> bool {
    // TODO: perhaps take `char` instead, avoiding this cast?

char / u8 TODO


src/parser/grammar/sequence.rs, line 16 at r2 (raw file):

        ({
            let mut seq = vec![a];
            // TODO: implement this with iterator combinators instead.

idk what iterator combinators are but you left a TODO to resolve here.


src/parser/grammar/sequence.rs, line 69 at r2 (raw file):

        assert_eq!(sequence_set(b"*"), Done(&b""[..], vec![Wildcard]));
        assert_eq!(sequence_set(b"1"), Done(&b""[..], vec![Number(1)]));
        // TODO: is this handled correctly by the parser?

Invalid sequences TODO


Comments from Reviewable

indiv0 commented 7 years ago

Review status: all files reviewed at latest revision, 8 unresolved discussions.


src/command/fetch.rs, line 21 at r2 (raw file):

Previously, uiri (Will Pearson) wrote…
Is there a reason to change the *external* API instead of just grabbing the bytestream inside of the fetch function?

I might be misunderstanding you here, but the external API (i.e., parser::fetch) should accept a bytestream instead of a string because we have no guarantee that the FETCH command being parsed is UTF-8 compatible.


src/parser/grammar/fetch.rs, line 180 at r2 (raw file):

Previously, uiri (Will Pearson) wrote…
According to the RFC, it MUST be ASCII. I think the way it is right now is fine?

Alright. I'll remove this TODO.


src/parser/grammar/fetch.rs, line 192 at r2 (raw file):

Previously, uiri (Will Pearson) wrote…
Streaming(?) TODO

Since we've decided to not provide a streaming-compatible API for now, I'll just remove these TODOs (and all other streaming-related ones).


Comments from Reviewable

uiri commented 7 years ago
:lgtm:

Reviewed 6 of 6 files at r3. Review status: 10 of 11 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

uiri commented 7 years ago

Reviewed 1 of 1 files at r4. Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable