kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.96k stars 192 forks source link

Regular expressions matching #283

Open KOLANICH opened 6 years ago

KOLANICH commented 6 years ago

It is proposed to implement an operator matching against a common subset of regular expressions (ECMAScript ones) supported (stdlib (JS, C++, python, PHP) or a separate well-known lib (PCRE for C) ) by every programming language supported by KS. The operator should accept an expression returning a string and return an array of strings the first argument is a regex the second is flags the third is the string expression to match

The result is an indexable object, which gives access to results (groups) by index. _is_success allows to check if the match is succesful. _length (or what do we have for that) allows to get the length of match results to_str (or what do we have for that) gives the whole regex matching result Usage of any groups if there was no success Look&feel is the following:

instances:
  a:
    value: "10:a, b:test ololo"
  parsed:
    value: "regExp('((\w+)\:([ab]|test)(?:, )?)+', 'ig', a)"
    # ["10", "a", "b", "test"]
  is_succesfuly_parsed:
    value: parsed._is_success # true
  whole:
    value: parsed.to_str #10:a, b:test
  length:
    value: parsed._length #4
GreyCat commented 6 years ago

Generally, regular expressions is a pretty non-trivial subject and it is especially so if we're discussing integration of different flavors of regexps available in different languages.

Can you provide any examples where this could be useful, i.e. for parsing purposes?

KOLANICH commented 6 years ago

There is a .mod format. It is a tracker music format, it has a string of 4 bytes in it. In that string an identifier is stored. Some of the identifuers have some digits meaning the number of channels. Also number of sequences varies depending on an identifier. For example M.K. means 31 sequence instead of 15. So we need to match against them with a series of regexes. Or alternatively create a huge and ugly expression of slices and conversions

GreyCat commented 6 years ago
KOLANICH commented 6 years ago

1 In real-life trackers it is assumed that it is a string. They usually would have writen an error if encountered an unknown or invalid ID, and all known and valid IDs are valid ascii strings. 2 In some real-life trackers the numbers are really parsed. 3 Yes, it's should be faster to match it as an integer, but IMHO since our specs are not only code but also docs written in a formal language, it's a bit ugly to write in such way as it obfuscates the meaning and the intentions. I mean since we think about them as about strings and as we think that the numbers are meaningful, we should put our thoughts into the code. 4 I have 13 regular expressions, 6 of them have \d templates and another 3 ones have alternating characters.

jocelynke commented 4 years ago

Hello guys, I was interested in the support of regex expressions to check if fields are valid. I hadn't the same requirement as @KOLANICH expressed, I had not the need to extract groups, just to check the validity of the whole field.

The proposal is to introduce this keyword :

- id: str_test
   type: str
   size: 2
   valid:
       regex: "[a-z]\d"

I have already implemented this feature for Cpp, Python, Java, JS, C# and Go, this was not extensively tested. For now I only encounter one major drawback : no native support of regex for Cpp98 which I did not solved yet.

If you are interested in this feature, the commit is available in my fork : https://github.com/jocelynke/kaitai_struct_compiler, if you have feedback this is valuable. I could keep on working on this feature to make it to kaitai master branch.