rust-bakery / nom

Rust parser combinator framework
MIT License
9.34k stars 801 forks source link

Allow types to be both stream and complete parsers #1535

Open epage opened 2 years ago

epage commented 2 years ago

Right now, nom has distinct types for complete and streaming parsers.

Problems with this

What if a type could implement have both types of parsing on it?

epage commented 2 years ago

combine has a single trait with two distinct parse methods. This doesn't work well with automatically implementing the parse trait for functions because you don't know which will be involved.

epage commented 2 years ago

What if we had

This would

We'd need to have to_completed and to_streamed conversion functions on what is currently the Finish trait to allow mixing parsers.

Stargateur commented 2 years ago

The following is a idea I have while doing my own parser. This is maybe not possible to do in nom or would require a lot of change but I don't see why not share ideas.

The idea was to not have streaming and complete parser like nom. I wanted to remove the problem of duplicate code in nom but I also wanted to have a way to make "streaming" parser like nom, currently nom have a major problem you can't "resume" where you stop. I saw two solutions:

For example:

struct Buf<Reader, const N: usize> {
  buf: Vec<u8>,
  reader: Reader,
}

#[derive(Clone)]
pub(crate) enum Position {
  RangeFrom(RangeFrom<usize>),
  Range(Range<usize>),
}

#[derive(Clone)]
pub struct ReaderStream<Reader, const N: usize> {
  buf: Rc<RefCell<Buf<Reader, N>>>,
  position: Position,
}

ReaderStream share the buffer with an Rc (could have an Arc version I guess), a ReaderStream is easily clonable and have a position that represent a range in the buffer.

Buf is responsable to grow when needed. (impl detail)

This design is quite, MUTABLE, but for the user it's hidden, then it has a lot of advantage, you don't resume when you need more data, you just read for more on the Reader. The Reader itself could be implemented to allow "sort of" async code, for example in tokio you could spawn a blocking thread for the parsing, that will wait the data read by an async task. This is not perfect but compared to resume from start the parsing it's quite nice. It's also remove the need of duplicate code and remove one branch in the outcome of a parser.

There is another problem, present in nom too, this buffer could become very large, that why I also think to a trait, while the implementation of this trait must be safe, this could allow bug in runtime if wrongly used, the purpose is simple, call the method if you have "finish" with previous data. Thus I still didn't work much on it:

pub trait Consume {
  fn consume(self) -> Self;
}

For now it's basic I didn't try this yet. The point would be to flush the buffer, making any use of previously used data an error. Thus I don't think a lot of parser could make use such trait.

epage commented 1 year ago

I've been playing with the Parser / StreamParser idea.

Currently, almost all of the parsers are either FnMut(I) -> IResult<I, O, E> or they return-type-impl that, rather than types implementing Parser<I, O, E>. This ties them to one specific parser. For this to work, we'll need to switch the parsers provided by nom to return a type that impls Parser like Parser::map does. The parsers will need to have no knowledge of the Parser trait and instead the returned type will impl the Parser trait.

To put this in more concrete terms, we need to turn

pub fn tag<I, O, C, E: ParseError<(I, usize)>>(
  pattern: O,
  count: C,
) -> impl Fn((I, usize)) -> IResult<(I, usize), O, E>
where
  I: Slice<RangeFrom<usize>> + InputIter<Item = u8> + InputLength + Clone,
  C: ToUsize,
  O: From<u8> + AddAssign + Shl<usize, Output = O> + Shr<usize, Output = O> + PartialEq,
{
  let count = count.to_usize();
  move |input: (I, usize)| {
    let inp = input.clone();

    take(count)(input).and_then(|(i, o)| {
      if pattern == o {
        Ok((i, o))
      } else {
        Err(Err::Error(error_position!(inp, ErrorKind::TagBits)))
      }
    })
  }
}

into

pub fn tag<O, C>(pattern: O, count: C) -> Tag<O, C> {
  Tag { pattern, count }
}

/// Implementation of [`tag`]
pub struct Tag<O, C> {
  pattern: O,
  count: C,
}

impl<I, O, C, E> Parser<(I, usize), O, E> for Tag<O, C>
where
  I: Slice<RangeFrom<usize>> + InputIter<Item = u8> + InputLength + Clone,
  C: ToUsize,
  O: From<u8> + AddAssign + Shl<usize, Output = O> + Shr<usize, Output = O> + PartialEq,
  E: ParseError<(I, usize)>,
{
  fn parse(&mut self, input: (I, usize)) -> IResult<(I, usize), O, E> {
    let count = self.count.to_usize();
    let inp = input.clone();

    take(count).parse(input).and_then(|(i, o)| {
      if self.pattern == o {
        Ok((i, o))
      } else {
        Err(Err::Error(error_position!(inp, ErrorKind::TagBits)))
      }
    })
  }
}

With this change, Tag will be free to impl StreamParser and you have a single tag for working with complete and streaming parsing.

epage commented 1 year ago

The compiler is not always happy with the above conversion

Take bits:

pub fn bits<I, O, E1, E2, P>(mut parser: P) -> impl FnMut(I) -> IResult<I, O, E2>
where
  E1: ParseError<(I, usize)> + ErrorConvert<E2>,
  E2: ParseError<I>,
  I: Slice<RangeFrom<usize>>,
  P: FnMut((I, usize)) -> IResult<(I, usize), O, E1>,
{
  move |input: I| match parser((input, 0)) {
    Ok(((rest, offset), result)) => {
      // If the next byte has been partially read, it will be sliced away as well.
      // The parser functions might already slice away all fully read bytes.
      // That's why `offset / 8` isn't necessarily needed at all times.
      let remaining_bytes_index = offset / 8 + if offset % 8 == 0 { 0 } else { 1 };
      Ok((rest.slice(remaining_bytes_index..), result))
    }
    Err(Err::Incomplete(n)) => Err(Err::Incomplete(n.map(|u| u.get() / 8 + 1))),
    Err(Err::Error(e)) => Err(Err::Error(e.convert())),
    Err(Err::Failure(e)) => Err(Err::Failure(e.convert())),
  }
}

When I transform this as described above, you get:

pub fn bits<P>(parser: P) -> Bits<P> {
  Bits { parser }
}

/// Implementation of [`bits`]
pub struct Bits<P> {
  parser: P,
}

impl<I, O, E1, E2, P> Parser<I, O, E2> for Bits<P>
where
  E1: ParseError<(I, usize)> + ErrorConvert<E2>,
  E2: ParseError<I>,
  I: Slice<RangeFrom<usize>>,
  P: Parser<(I, usize), O, E1>,
{
  fn parse(&mut self, input: I) -> IResult<I, O, E2> {
    match self.parser((input, 0)) {
      Ok(((rest, offset), result)) => {
        // If the next byte has been partially read, it will be sliced away as well.
        // The parser functions might already slice away all fully read bytes.
        // That's why `offset / 8` isn't necessarily needed at all times.
        let remaining_bytes_index = offset / 8 + if offset % 8 == 0 { 0 } else { 1 };
        Ok((rest.slice(remaining_bytes_index..), result))
      }
      Err(Err::Incomplete(n)) => Err(Err::Incomplete(n.map(|u| u.get() / 8 + 1))),
      Err(Err::Error(e)) => Err(Err::Error(e.convert())),
      Err(Err::Failure(e)) => Err(Err::Failure(e.convert())),
    }
  }
}

but this fails because of

$ cargo check
    Checking nom v7.1.1 (/home/epage/src/personal/nom-experimental)
error[E0207]: the type parameter `E1` is not constrained by the impl trait, self type, or predicates
  --> src/bits/mod.rs:50:12
   |
50 | impl<I, O, E1, E2, P> Parser<I, O, E2> for Bits<P>
   |            ^^ unconstrained type parameter

For more information about this error, try `rustc --explain E0207`.
error: could not compile `nom` due to previous error

We cannot constrain E1 through the P constraint.

Normally, this would be handled by adding PhantomData to Bits but the only way I see to do that is by making Bits aware of the Parser trait but that fails in our goal of being independent of the trait so that multiple *Parser traits can be implemented.

Running rustc --explain E0207, we get this:

Any type parameter of an `impl` must meet at least one of
the following criteria:

 - it appears in the _implementing type_ of the impl, e.g. `impl<T> Foo<T>`
 - for a trait impl, it appears in the _implemented trait_, e.g.
   `impl<T> SomeTrait<T> for Foo`
 - it is bound as an associated type, e.g. `impl<T, U> SomeTrait for T
   where T: AnotherTrait<AssocType=U>`

While Parser<_,. E1> doesn't seem to be sufficient, it sounds like Parser<I=_, O=_, E=E1> might be. So I guess the next step is to test the feasibility of switching the generic parameters for Parser to associated types. I believe I at least needs to be generic.

epage commented 1 year ago

Switching to associated types didn't cause any problems: epage/nom-experimental#5

Switching tag (the working case) didn't cause too many problems: epage/nom-experimental#6

Still fighting bits, the problematic case for which I switched to associated types

epage commented 1 year ago

Thinking back through the latest problems I'm having and the first, I think I'm going to capture O and E as phantom data and I don't think I'll need associated types anymore.

epage commented 1 year ago

Got it past the constraint issue: https://github.com/epage/nom-experimental/pull/6

epage commented 1 year ago

So I've gotten further with epage/nom-experimental#6 and am looking at going with a different direction.

As a recap, my goals are

With epage/nom-experimental#6, I'm trying to solve that by forking the Parser trait into Parser and StreamParser. All of the existing "parsers" are switching from directly implementing Parser or returning an opaque type that does to returning a hand-implemented type that will implement both Parser and StreamParser. Which trait you call into decides which approach is taken, simplifying the hierarchy, catching mix ups at compile time, and allowing &str to implement both styles of parser.

Problems I've run into include:

✅ The goals are being met.

Some things I think I like (not fully decided on all of them)

Overall, I think I should step back and evaluate this compared to what I think makes nom great

In considering alternatives designs, something I keep coming back to is that complete vs stream is an attribute of the Input. For example, if we could annotate the input with a enum Stream<Input> { Complete(Input), Incomplete(Input) } then we can react accordingly.

As an alternative, what if we had:

pub trait Stream {
    fn is_complete() -> bool;
}

pub struct Complete<Input>(Input);

// TODO: Also implement all relevant input traits
impl<Input> Stream for Complete<Input> {
    #[inline]
    fn is_complete() -> bool {
        true
    }
}

pub struct Incomplete<Input>(Input);

// TODO: Also implement all relevant input traits
impl<Input> Stream for Incomplete<Input> {
    #[inline]
    fn is_complete() -> bool {
        false
    }
}

// TODO: Also implement all relevant input traits
impl<Token> Stream for &[Token] {
    #[inline]
    fn is_complete() -> bool {
        false
    }
}

Now, each of the parsers that can be either complete or streaming will add an additional constraint (Input: Stream) and switch on Input::is_complete() . I am assuming that is_complete will be inlined and the compiler will optimize this so it was as-if the code was only complete or streaming with no runtime overhead.

Evaluating this compared to what makes nom great:

Compared to problems with epage/nom-experimental#6

Compared to my goals

I think this strikes a balance worth exploring and I'll a crack at this next.

epage commented 1 year ago

epage/nom-experimental#28 seems to be working well! I think it can be used to resolve this Issue and unblock other work (like implementing Parser for types like &str)

epage commented 1 year ago

btw since I did my write up, I ended up going in the direction of making the InputIsStreaming trait have a const: bool parameter. This allows some extra error checking (e.g. finish only working with complete parsers) and potential for some more interesting cases.

The most obvious downside is that it requires an extra trait parameter.

When implementing Parser for char, etc I also could only support complete parsing. I think the non-const approach would allow chars to be treated as both complete and streaming parsers.