Parsing RLP, varint, and other variable-length encoded chunks

I'm writing two parsers for a simple forth-like, one for the 'assembly' and the other for the bytecode. Some ops have a variable-length argument in-code, encoded as varints or RLP.

Here is what a generic op that has an RLP-encoded argument looks like in the assembler: RLP_OP [1, 2, [10, 20, 30]] NOP NOP NOP This is straightforward to parse.

The RLP encoding of [1,2,[10,20,30]] is c60102c30a141e. Suppose RLP_OP encodes as 07 and NOP as 00. The javascript RLP lib gives the helpful option to parse only a prefix of a given bytestring if it is valid RLP. In other words we could use rlp.decode(encoded, skipRemainderCheck=true) to extract c60102c30a141e from c60102c30a141e00000000.

So basically the challenge is to parse 07c60102c30a141e000000 into the same AST as RLP_OP [1,2,[10,20,30]] NOP NOP NOP.

I see nearley's custom token matcher functionality. How could I combine this with rlp.decode(encoded, skipRemainderCheck=true) to tokenize RLP-encoded chunks? To be clear, I just need the lexer to give me an RLP chunk, I don't need to parse the recursive lists themselves into the AST (this can be done in a later pass).

The same technique could be used for varints and whatever else.

kach / nearley

Parsing RLP, varint, and other variable-length encoded chunks #545