kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.6k stars 232 forks source link

Using Nearley to parse fixed lengthed hex-string #511

Closed siowzhenhan closed 4 years ago

siowzhenhan commented 4 years ago

Hi,

I fairly new to Nearley, and I am writing a parser to split up a hex-string into separate parts by defining the fixed lengths of each part. Seemingly easy question but I've been stuck here.

Consider a full hex-string: "010e002dfb0c061006c800000000270f030400030000040402ebff390506017417ac2206"

I would like to parse it into 4 non-terminals, with the below length restrictions: Part1 -> (first 32 characters) Part2 -> (next 12 characters) Part3 -> (next 12 characters) Part4 -> (last 16 characters)

So that my full string can be split like so: Part1 -> [010e002dfb0c061006c800000000270f] Part2 -> [030400030000] Part3 -> [040402ebff39] Part4 -> [0506017417ac2206]

I am aware that Nearley only supports regex-like syntax for charsets, as well as EBNF for quantifiers; so I cannot use something like Part1 -> [a-fA-F0-9]{32} to get my desired parse results.

I have tried using the above regex in a lexer to tokenise based on the specified lengths: let lexer = moo.compile({ Part1: /[a-fA-F0-9]{32}/, Part2: /[a-fA-F0-9]{12}/, Part3: /[a-fA-F0-9]{12}/, Part4: /[a-fA-F0-9]{16}/, }); However, I get the same repeated tokens, i.e. Part1 followed by another Part1, as it just tokenises every set-number of hexadecimals. Struggling to regex effectively as the hex-string doesn't have a 'pattern' other than the pre-defined lengths.

Is this something I can do within the grammar/lexer itself?

Can anyone please shed some light on this? Thanks in advance.

dabbott commented 4 years ago

It's a bit low-tech, but howabout:

HexString -> Part1 Part2 Part3 Part4
Part1 -> C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C {% d => d.join('') %}
Part2 -> C C C C C C C C C C C C {% d => d.join('') %}
Part3 -> C C C C C C C C C C C C {% d => d.join('') %}
Part4 -> C C C C C C C C C C C C C C C C {% d => d.join('') %}
C -> [a-fA-F0-9] {% id %}

Screen Shot 2020-05-20 at 7 17 38 PM

It's slow I'm sure, but seems to work! I think if you want to do it in Moo, you'll need to transition to a different state after parsing each part. Parsing it as a single string in the grammar and splitting it into 4 parts in a postprocessor function could also work. I'm new to Nearley also, so there might be a better way!

siowzhenhan commented 4 years ago

Thanks for the response @dabbott !

Managed to get it working! And it's the only way I could think of using Nearley too. The only worry with this method is maintainability/readability as the grammar grows more complex - hence I wondered if this could be written in a cleaner way with regex somehow. Thanks for the tips on Moo as well!

I've also found some binary-parsers that may handle hex parsing in a more concise manner - something like https://github.com/keichi/binary-parser could be used in combination with Nearley as well!

tjvr commented 4 years ago

You can do actually do this with a single RegExp by using capturing groups:

const regexp = /([0-9a-fA-F]{32})([0-9a-fA-F]{12})([0-9a-fA-F]{12})([0-9a-fA-F]{16})/
const source = "010e002dfb0c061006c800000000270f030400030000040402ebff390506017417ac2206"
const match = regexp.exec(source)
match[1] // "010e002dfb0c061006c800000000270f"
match[2] // "030400030000"
match[3] // "040402ebff39"
match[4] // "0506017417ac2206"

(Moo doesn't support capturing groups, but it's not clear from your question whether this is part of a larger Nearley grammar.)