kach / nearley

πŸ“œπŸ”œπŸŒ² Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.57k stars 232 forks source link

Parsing RLP, varint, and other variable-length encoded chunks #545

Open mr-word opened 3 years ago

mr-word commented 3 years ago

I'm writing two parsers for a simple forth-like, one for the 'assembly' and the other for the bytecode. Some ops have a variable-length argument in-code, encoded as varints or RLP.

Here is what a generic op that has an RLP-encoded argument looks like in the assembler: RLP_OP [1, 2, [10, 20, 30]] NOP NOP NOP This is straightforward to parse.

The RLP encoding of [1,2,[10,20,30]] is c60102c30a141e. Suppose RLP_OP encodes as 07 and NOP as 00. The javascript RLP lib gives the helpful option to parse only a prefix of a given bytestring if it is valid RLP. In other words we could use rlp.decode(encoded, skipRemainderCheck=true) to extract c60102c30a141e from c60102c30a141e00000000.

So basically the challenge is to parse 07c60102c30a141e000000 into the same AST as RLP_OP [1,2,[10,20,30]] NOP NOP NOP.

I see nearley's custom token matcher functionality. How could I combine this with rlp.decode(encoded, skipRemainderCheck=true) to tokenize RLP-encoded chunks? To be clear, I just need the lexer to give me an RLP chunk, I don't need to parse the recursive lists themselves into the AST (this can be done in a later pass).

The same technique could be used for varints and whatever else.

mr-word commented 3 years ago

Since writing this issue I'm understanding it better, I think it's quite simple, I'm going to try something like this and report back,

isRLP = (str) => {
   try {
       rlp.decode(str, skipRemainderCheck=false)
       return true
   } catch(..) {
     return false
   }
}
rlpMatcher = (x) => isRLP(x)

In particular I don't use skipRemainderCheck.