Allow types to be both stream and complete parsers

epage commented 2 years ago

Right now, nom has distinct types for complete and streaming parsers.

Problems with this

Requires duplicate module hierarchies (#1414 would move them around)
Can't provide convenience implementations for types like &str because one of the two has to be offered (#1408)
Defers programming errors to run time that make catching them hard (require incomplete cases)
- Finish::finish will panic on Incomplete
- Accidentally using a parser intended for one in the context of the other

What if a type could implement have both types of parsing on it?

epage commented 2 years ago

combine has a single trait with two distinct parse methods. This doesn't work well with automatically implementing the parse trait for functions because you don't know which will be involved.

epage commented 2 years ago

What if we had

Parser that used an IResult with a CompletedErr
StreamParser that used an IResult with a StreamErr

This would

make the return types distinct, catching "wrong parse method" errors.
hopefully allow blanket impls of the traits on functions based on the Err used

We'd need to have to_completed and to_streamed conversion functions on what is currently the Finish trait to allow mixing parsers.

Stargateur commented 2 years ago

The following is a idea I have while doing my own parser. This is maybe not possible to do in nom or would require a lot of change but I don't see why not share ideas.

The idea was to not have streaming and complete parser like nom. I wanted to remove the problem of duplicate code in nom but I also wanted to have a way to make "streaming" parser like nom, currently nom have a major problem you can't "resume" where you stop. I saw two solutions:

use async but: this require async trait macro for now that allocate a lot. Also this have a cost for user that doesn't need async. This could be a good solution in the future when we could have async trait maybe this would also require some degree of duplicate code I didn't push this a lot yet.
shift this responsibility to Input. That what I go for in my current experimentation. In nom the trait InputIter this would allow to do mostly what I did.

For example:

struct Buf<Reader, const N: usize> {
  buf: Vec<u8>,
  reader: Reader,
}

#[derive(Clone)]
pub(crate) enum Position {
  RangeFrom(RangeFrom<usize>),
  Range(Range<usize>),
}

#[derive(Clone)]
pub struct ReaderStream<Reader, const N: usize> {
  buf: Rc<RefCell<Buf<Reader, N>>>,
  position: Position,
}

ReaderStream share the buffer with an Rc (could have an Arc version I guess), a ReaderStream is easily clonable and have a position that represent a range in the buffer.

Buf is responsable to grow when needed. (impl detail)

This design is quite, MUTABLE, but for the user it's hidden, then it has a lot of advantage, you don't resume when you need more data, you just read for more on the Reader. The Reader itself could be implemented to allow "sort of" async code, for example in tokio you could spawn a blocking thread for the parsing, that will wait the data read by an async task. This is not perfect but compared to resume from start the parsing it's quite nice. It's also remove the need of duplicate code and remove one branch in the outcome of a parser.

There is another problem, present in nom too, this buffer could become very large, that why I also think to a trait, while the implementation of this trait must be safe, this could allow bug in runtime if wrongly used, the purpose is simple, call the method if you have "finish" with previous data. Thus I still didn't work much on it:

pub trait Consume {
  fn consume(self) -> Self;
}

For now it's basic I didn't try this yet. The point would be to flush the buffer, making any use of previously used data an error. Thus I don't think a lot of parser could make use such trait.

epage commented 1 year ago

I've been playing with the Parser / StreamParser idea.

Currently, almost all of the parsers are either FnMut(I) -> IResult<I, O, E> or they return-type-impl that, rather than types implementing Parser<I, O, E>. This ties them to one specific parser. For this to work, we'll need to switch the parsers provided by nom to return a type that impls Parser like Parser::map does. The parsers will need to have no knowledge of the Parser trait and instead the returned type will impl the Parser trait.

To put this in more concrete terms, we need to turn

pub fn tag<I, O, C, E: ParseError<(I, usize)>>(
  pattern: O,
  count: C,
) -> impl Fn((I, usize)) -> IResult<(I, usize), O, E>
where
  I: Slice<RangeFrom<usize>> + InputIter<Item = u8> + InputLength + Clone,
  C: ToUsize,
  O: From<u8> + AddAssign + Shl<usize, Output = O> + Shr<usize, Output = O> + PartialEq,
{
  let count = count.to_usize();
  move |input: (I, usize)| {
    let inp = input.clone();

    take(count)(input).and_then(|(i, o)| {
      if pattern == o {
        Ok((i, o))
      } else {
        Err(Err::Error(error_position!(inp, ErrorKind::TagBits)))
      }
    })
  }
}

into

pub fn tag<O, C>(pattern: O, count: C) -> Tag<O, C> {
  Tag { pattern, count }
}

/// Implementation of [`tag`]
pub struct Tag<O, C> {
  pattern: O,
  count: C,
}

impl<I, O, C, E> Parser<(I, usize), O, E> for Tag<O, C>
where
  I: Slice<RangeFrom<usize>> + InputIter<Item = u8> + InputLength + Clone,
  C: ToUsize,
  O: From<u8> + AddAssign + Shl<usize, Output = O> + Shr<usize, Output = O> + PartialEq,
  E: ParseError<(I, usize)>,
{
  fn parse(&mut self, input: (I, usize)) -> IResult<(I, usize), O, E> {
    let count = self.count.to_usize();
    let inp = input.clone();

    take(count).parse(input).and_then(|(i, o)| {
      if self.pattern == o {
        Ok((i, o))
      } else {
        Err(Err::Error(error_position!(inp, ErrorKind::TagBits)))
      }
    })
  }
}

With this change, Tag will be free to impl StreamParser and you have a single tag for working with complete and streaming parsing.

epage commented 1 year ago

The compiler is not always happy with the above conversion

Take bits:

pub fn bits<I, O, E1, E2, P>(mut parser: P) -> impl FnMut(I) -> IResult<I, O, E2>
where
  E1: ParseError<(I, usize)> + ErrorConvert<E2>,
  E2: ParseError<I>,
  I: Slice<RangeFrom<usize>>,
  P: FnMut((I, usize)) -> IResult<(I, usize), O, E1>,
{
  move |input: I| match parser((input, 0)) {
    Ok(((rest, offset), result)) => {
      // If the next byte has been partially read, it will be sliced away as well.
      // The parser functions might already slice away all fully read bytes.
      // That's why `offset / 8` isn't necessarily needed at all times.
      let remaining_bytes_index = offset / 8 + if offset % 8 == 0 { 0 } else { 1 };
      Ok((rest.slice(remaining_bytes_index..), result))
    }
    Err(Err::Incomplete(n)) => Err(Err::Incomplete(n.map(|u| u.get() / 8 + 1))),
    Err(Err::Error(e)) => Err(Err::Error(e.convert())),
    Err(Err::Failure(e)) => Err(Err::Failure(e.convert())),
  }
}

When I transform this as described above, you get:

pub fn bits<P>(parser: P) -> Bits<P> {
  Bits { parser }
}

/// Implementation of [`bits`]
pub struct Bits<P> {
  parser: P,
}

impl<I, O, E1, E2, P> Parser<I, O, E2> for Bits<P>
where
  E1: ParseError<(I, usize)> + ErrorConvert<E2>,
  E2: ParseError<I>,
  I: Slice<RangeFrom<usize>>,
  P: Parser<(I, usize), O, E1>,
{
  fn parse(&mut self, input: I) -> IResult<I, O, E2> {
    match self.parser((input, 0)) {
      Ok(((rest, offset), result)) => {
        // If the next byte has been partially read, it will be sliced away as well.
        // The parser functions might already slice away all fully read bytes.
        // That's why `offset / 8` isn't necessarily needed at all times.
        let remaining_bytes_index = offset / 8 + if offset % 8 == 0 { 0 } else { 1 };
        Ok((rest.slice(remaining_bytes_index..), result))
      }
      Err(Err::Incomplete(n)) => Err(Err::Incomplete(n.map(|u| u.get() / 8 + 1))),
      Err(Err::Error(e)) => Err(Err::Error(e.convert())),
      Err(Err::Failure(e)) => Err(Err::Failure(e.convert())),
    }
  }
}

but this fails because of

$ cargo check
    Checking nom v7.1.1 (/home/epage/src/personal/nom-experimental)
error[E0207]: the type parameter `E1` is not constrained by the impl trait, self type, or predicates
  --> src/bits/mod.rs:50:12
   |
50 | impl<I, O, E1, E2, P> Parser<I, O, E2> for Bits<P>
   |            ^^ unconstrained type parameter

For more information about this error, try `rustc --explain E0207`.
error: could not compile `nom` due to previous error

We cannot constrain E1 through the P constraint.

Normally, this would be handled by adding PhantomData to Bits but the only way I see to do that is by making Bits aware of the Parser trait but that fails in our goal of being independent of the trait so that multiple *Parser traits can be implemented.

Running rustc --explain E0207, we get this:

Any type parameter of an `impl` must meet at least one of
the following criteria:

 - it appears in the _implementing type_ of the impl, e.g. `impl<T> Foo<T>`
 - for a trait impl, it appears in the _implemented trait_, e.g.
   `impl<T> SomeTrait<T> for Foo`
 - it is bound as an associated type, e.g. `impl<T, U> SomeTrait for T
   where T: AnotherTrait<AssocType=U>`

While Parser<_,. E1> doesn't seem to be sufficient, it sounds like Parser<I=_, O=_, E=E1> might be. So I guess the next step is to test the feasibility of switching the generic parameters for Parser to associated types. I believe I at least needs to be generic.

epage commented 1 year ago

Switching to associated types didn't cause any problems: epage/nom-experimental#5

Switching tag (the working case) didn't cause too many problems: epage/nom-experimental#6

Still fighting bits, the problematic case for which I switched to associated types

epage commented 1 year ago

Thinking back through the latest problems I'm having and the first, I think I'm going to capture O and E as phantom data and I don't think I'll need associated types anymore.

epage commented 1 year ago

Got it past the constraint issue: https://github.com/epage/nom-experimental/pull/6

epage commented 1 year ago

So I've gotten further with epage/nom-experimental#6 and am looking at going with a different direction.

As a recap, my goals are

Allow using string literals as tag() parsers without having to hard coder to only complete or streaming (#1408)
Simplify the module hierarchy (#1414)
Help catch developer errors in mixing complete/stream

With epage/nom-experimental#6, I'm trying to solve that by forking the Parser trait into Parser and StreamParser. All of the existing "parsers" are switching from directly implementing Parser or returning an opaque type that does to returning a hand-implemented type that will implement both Parser and StreamParser. Which trait you call into decides which approach is taken, simplifying the hierarchy, catching mix ups at compile time, and allowing &str to implement both styles of parser.

Problems I've run into include:

🔴 Type inference doesn't work as well in some limited cases, particularly with closures
🔴 I worry that Parser::map and StreamParser::map, etc will cause problems if both present
🔴 Some compiler errors are likely worse because its not about a bad parameter to the function but the type not satisfying trait bounds
🔴 Extra noise / code because we have to switch from compiler-generated opaque types via impl to hand written types with phantom data for several generic parameters
🔴 For parsers that have a single behavior, you have to duplicate the logic for Parser and StreamParser. You can't delegate or you could call the wrong trait deeply nested down

✅ The goals are being met.

Some things I think I like (not fully decided on all of them)

✅ All functions are consistently parser factories rather than a mix of pars er factory and parser.
✅ We can implement other types or behaviors on the non-opaque types

Overall, I think I should step back and evaluate this compared to what I think makes nom great

Runtime performance
Low ceremony to implement the Parser trait, just write a function with understandable types (with IResult doing a lot of heavy lifting)
Writing a Parser works directly with an input rather than just describing the parser via combinators / existing parser
Flexible, whether on what tokens are accepted, what streams are accepted, etc

In considering alternatives designs, something I keep coming back to is that complete vs stream is an attribute of the Input. For example, if we could annotate the input with a enum Stream<Input> { Complete(Input), Incomplete(Input) } then we can react accordingly.

Having parsers not directly take &str would add ceremony
We are unlikely to want the runtime overhead of each parser switching on this and are unlikely to need this at compile time
While this better surfaces when you switch between Complete and Incomplete, this doesn't help make things type safe.

As an alternative, what if we had:

pub trait Stream {
    fn is_complete() -> bool;
}

pub struct Complete<Input>(Input);

// TODO: Also implement all relevant input traits
impl<Input> Stream for Complete<Input> {
    #[inline]
    fn is_complete() -> bool {
        true
    }
}

pub struct Incomplete<Input>(Input);

// TODO: Also implement all relevant input traits
impl<Input> Stream for Incomplete<Input> {
    #[inline]
    fn is_complete() -> bool {
        false
    }
}

// TODO: Also implement all relevant input traits
impl<Token> Stream for &[Token] {
    #[inline]
    fn is_complete() -> bool {
        false
    }
}

Now, each of the parsers that can be either complete or streaming will add an additional constraint (Input: Stream) and switch on Input::is_complete() . I am assuming that is_complete will be inlined and the compiler will optimize this so it was as-if the code was only complete or streaming with no runtime overhead.

Evaluating this compared to what makes nom great:

✅ Runtime performance should stay the same as the conditional should get compiled out
✅ Common cases are low-ceremony, the signatures remain the same
- ⚠️ Writing a generic one might require one extra trait to be specified (users already specify so many it shouldn't be a problem)
✅ Writing a parser still works directly with inputs
✅ Still flexible

Compared to problems with epage/nom-experimental#6

✅ Type inference will stay the same as today
✅ Single trait, so no trait function ambiguity
✅ Compile errors will stay the same as today
✅ No hand-written opaque types are needed
✅ Only the parts of the code that need Stream will care about it while all of the other parts can be agnostic to it

Compared to my goals

✅ Allow using string literals as tag() parsers without having to hard coder to only complete or streaming (#1408)
✅ Simplify the module hierarchy (#1414)
⚠️ When wrapping and unwrapping the input with Incomplete, it will help catch errors at compile time but could still make mistakes with Err<E> though extension traits like Finish to automate the conversion could help
- I'd love it for Stream to define whether to use a CompleteErr<E> or IncompleteErr<E> but I can't think of a way to make that work while keeping things low-ceremony without specialization

I think this strikes a balance worth exploring and I'll a crack at this next.

epage commented 1 year ago

epage/nom-experimental#28 seems to be working well! I think it can be used to resolve this Issue and unblock other work (like implementing Parser for types like &str)

epage commented 1 year ago

btw since I did my write up, I ended up going in the direction of making the InputIsStreaming trait have a const: bool parameter. This allows some extra error checking (e.g. finish only working with complete parsers) and potential for some more interesting cases.

The most obvious downside is that it requires an extra trait parameter.

When implementing Parser for char, etc I also could only support complete parsing. I think the non-const approach would allow chars to be treated as both complete and streaming parsers.

rust-bakery / nom

Allow types to be both stream and complete parsers #1535