m4rw3r / chomp

A fast monadic-style parser combinator designed to work on stable Rust.
Apache License 2.0
244 stars 19 forks source link

Is there a way to get current position? #38

Closed readysetmark closed 8 years ago

readysetmark commented 8 years ago

Hi! I'm wondering if it would be possible to add a function that could provide the current position in the file (or stream)?

In my case, I'm parsing from a file and would like to capture the line number in particular.

I haven't had a chance to dig through the code much yet, but if I were to take a stab at adding it, I'd definitely appreciate a few pointers! I'm guessing it would have to bubble up from the buffer...

m4rw3r commented 8 years ago

It depends on the intended use and how your parser is structured. If you can specify more precisely what you need the line-number for (and when and in what context) it will be easier to find a decent solution.

There are some pointers to be found if you look at what people do with Attoparsec, eg. Conduit uses a wrapper which counts the number of lines (and columns) which are passed to the wrapped parser so that it can provide offsets for errors and parsed tokens. This works since Conduit is often used to parse a single token at a time with the parser and not actually the whole file in a single pass.

In the case of Attoparsec it also has some internals which can be exposed to obtain the position directly, but Chomp does actually not store any position while parsing, it is all slices. (Currently the buffers work by calculating the difference of the remainder with the original, which is not optimal at all times.)

If you want to arbitrarily obtain the offset of the parser (eg. to store start and end offset of a parsed token within a parser itself) it somehow needs to be propagated downwards. This will probably require modifications to the Input and ParseResult types to store some kind of counter from previously parsed data and also an offset into the slice currently being parsed. I am not so sure that is a good idea since it will entail a performance penalty even when you do not need the position.

Parsers could also be made to be generic over the type (ie. make one monadic type which just facilitates parsing and then another extension of this to allow the parsers to obtain the position of the parser) but I fear that the function-signatures of parsers would be a bit too complicated in that case since generics are needed (and to make it simple we would most likely need higher kinded types).

readysetmark commented 8 years ago

I'm writing a parser for a text-based file of financial transactions -- ledger files, actually. I'm looking to capture the line in the source file where a transaction begins. The file would look something like this:

2016-02-01 * Rent
  Expenses:Home:Rent        $750.00
  Assets:Chequing

2016-02-09 * SuperMart
  Expenses:Food:Groceries   $89.54
  Expenses:Pet:Supplies     $58.51
  Liabilities:CreditCard    $-148.06

This would end up breaking down into two transactions, one starting on line 1, and another starting on line 5.

The line numbers end up being helpful for directing the user to any transactions that fail validation (are out of balance, for example, like the second transaction here). I certainly don't need the line numbers, but wasn't sure if this was going to be trivial to do or somewhat complex. Sounds like it's not trivial, though!

m4rw3r commented 8 years ago

If it is possible to create a main parser where you parse each transaction by itself, something like this:

fn whitespace(i: Input<u8>) -> U8Result<&[u8]> { take_while1(i, is_horizontal_space) }
fn parse_transaction(i: Input<u8>) -> U8Result<Transaction> {
    parse!{i;
        let date = parse_date() <* whitespace() <* token(b'*');
        // ...
        let items = many(parse_transaction_line);
        ret Transaction { date, items }
    }
}
fn parse_transaction_line(i: Input<u8>) -> U8Result<...> {
    parse!{i;
        whitespace();  // to make sure we get an indented line, fail on nonindent
        // ...
    }
}

and then apply that one repeatedly you could use the trick used by Conduit and use a wrapper. This wrapper can count how many line-endings are present in the slices fed to the parser (on success). That way you can associate a starting and ending line (or byte-offset if that is desired) with the Transaction structs.

This would probably be a good addition to Chomp itself, so that a parser yielding ParseResult<T, E> could be wrapped to yield ParseResult<(T, Position), (E, Position)> or something similar.

m4rw3r commented 8 years ago

Something like this could work for your use case: https://gist.github.com/m4rw3r/1f43559dcd73bf46e845

I am not too sure that is the right direction, have been thinking of making the Input type generic over a position carrying element internally (so that the normal chomp::Input is only generic over the input type and does not carry position) but a specific chomp::InputPosition<LineNumber> could be constructed or something similar. This would require the same to be true for the ParseResult and to make them generic is will probably be very cumbersome to use in a generic way (as the position-aware parsers will not be composable inside of a non position-aware parser).

dashed commented 8 years ago

By coincidence, I'm shopping around for rust parser libraries to interpret ledger-like files and am running into the same issue.


@m4rw3r thanks for composing that gist, I've yet to grok it; the wrapper strategy may be the way to go for now.

dashed commented 8 years ago

There seems to be inconsistent behaviour in line counting from the provided gist. Just in case anyone needs it, modified it as follows for my needs: https://gist.github.com/dashed/9d18b7e4cc351a7feabc89897a58baff

m4rw3r commented 8 years ago

Got some nice trait-based generic counting going in the feature/input_trait branch:

use chomp::types::{Input, ParseResult};
use chomp::types::numbering::{InputPosition, LineNumber, Numbering};
use chomp::combinators::many;
use chomp::parsers::{Error, any, take_while1, string};

// Let's count some lines
let i = InputPosition::new(&b"test a\ntest b\n\ntest c\n"[..], LineNumber::new());

// We could use a concrete type P too if we want to restrict to a
// certain position-aware implementation
fn parser<I: Input<Token=u8>, P: Numbering<Token=u8>>(i: InputPosition<I, P>)
  -> ParseResult<InputPosition<I, P>, (char, P), Error<u8>> {
    parse!{i;
                     string(b"test");
                     take_while1(|c| c == b' ' || c == b'\t');
        let t_name = any();
        i -> {
            // Obtain current parser position
            let p = i.position();

            i.ret((t_name as char, p))
            // We skip the take while below, because we want to determine the line right
            // after reading the t_name
        } <* take_while1(|c| c == b'\n');
    }
}

let r = run_parser(i, |i| many(i, parser));

assert_eq!(r, Ok(vec![('a', LineNumber(0)),
                      ('b', LineNumber(1)),
                      // Note the two linebreaks in a row
                      ('c', LineNumber(3))]));

I need to look at the ergonomics of using this, since the syntax is pretty heavy, especially when using generic position-aware code. Unless position-information is required by the specific code inside of the parser then it should not be required to annotate the functions with all these types, a plain Input generic should suffice for most cases.

Edit: See https://github.com/m4rw3r/chomp/blob/feature/input_trait/src/types/numbering.rs

m4rw3r commented 8 years ago

Version 0.3.0 is out with support for creating parsers which are position-aware: http://m4rw3r.github.io/chomp/chomp/types/numbering/index.html

m4rw3r commented 8 years ago

Closing this since the feature is in, just needs more examples and probably a line+column counter.