kursjan / petitparser2

A high-performance top-down parser
MIT License
41 stars 17 forks source link

Parsing binary data #68

Open udoschneider opened 3 years ago

udoschneider commented 3 years ago

I basically have a parser ready to parse text input. However another representation of the source is binary - although with the exact same AST structure. It seems that adding those two methods allow me to treat ByteArray as byte sequences and Integers as bytes:

ByteArray>>#asPParser
    ^ PP2LiteralSequenceNode on: self

Integer>>#asPParser
    ^ PP2LiteralObjectNode on: self

Thinking about it doing something like

SequencableCollection>>#asPParser
    ^ PP2LiteralSequenceNode on: self

would even allow parsing "numeric collection" in general ...

Is this the way to go?

kursjan commented 3 years ago

Hi Udo,

not sure what is your goal. the mentioned asPParser methods allow you to do:

For ByteArray: 'foobar' asPParser parse: 'foobar'

I am not sure what exactly the Integer>>asPParser do. Can it be used as following? 'a' asInteger asPParser parse: 'a'

What kind of use case would you like to add?

udoschneider commented 3 years ago

Hi Kursjan,

the generic idea is to be able to parse binary data (given as ByteArray) where each element is an Integer (byte). I can't disclose the protocol I work on (NDA) but I think WebAssembly is a good example.

E.g. the text format (p. 132) defines

For example, the textual grammar for value types is given as follows:

valtype ::= ‘i32’ ⇒ i32
| ‘i64’ ⇒ i64
| ‘f32’ ⇒ f32
| ‘f64’ ⇒ f64

E.g. the binary format (p. 114) defines

For example, the binary grammar for value types is given as follows:

valtype ::= 0x7F ⇒ i32
| 0x7E ⇒ i64
| 0x7D ⇒ f32
| 0x7C ⇒ f64

However once the valtype token has been parsed all the higher level combination rules work exactly the same.

So the basic idea would be for PP to be able to parse binary literals by adding ByteArray>>#asPParser and Integer>>#asPParser. This would allow to define a WASMTextParser as subclass of PP2CompositeNode with

valtype
    ^ ('i32' asPParser / 'i64' asPParser / 'f32' asPParser / 'f64' asPParser) ==> [:type | WASMValtypeNode type: type]

WASMTextParser would then define all the production rules on top of this valtype definition.

And in WASMBinaryParser (as subclass of WASMTextParser) would simply overwrite valtype as

valtype
    ^ (16r7F asPParser / 16r7E asPParser / 16r7D asPParser / 16r7C asPParser) ==> [:type | WASMValtypeNode type: type]

However all the higher level production rules in the superclass would still work.

So the only difference here would be how to parse literals - string on one hand (as usual) but also binary (what I proposed).

Does that help?

kursjan commented 3 years ago

Hi Udo,

did I get it right that WASMTextParser should already work?

valtype
    ^ ('i32' asPParser / 'i64' asPParser / 'f32' asPParser / 'f64' asPParser) ==> [:type | WASMValtypeNode type: type]

String>>asPParser is already defined and would create a LiteralSequence parser.

Your proposal of extending ByteArray, Integer with asPParser sounds pretty much OK and aligned with the current PetitParser design. How would the extension look like? Something along these lines?

Integer>>asPParser
  ^ PP2LiteralObjectNode on: (Character from: self)