katef / libfsm

DFA regular expression library & friends
BSD 2-Clause "Simplified" License
930 stars 52 forks source link

Feature request: Add support for embedded NUL bytes in PCRE patterns #468

Open VictorSCushmanFastly opened 3 months ago

VictorSCushmanFastly commented 3 months ago

Patterns containing embedded NUL bytes are successfully compiled with pcre2_compile when a non-PCRE2_ZERO_TERMINATED length argument is provided to pcre2_compile (e.g. for length-counted binary strings).

These same patters do not compile successfully with libfsm, where (currently) RE_EXEOF is returned from re_comp.

This behavior can be tested from the command line with:

$ echo -ne 'a\x00b' | re -l c -k pair -r pcre -y /dev/stdin
/dev/stdin:1: Syntax error: expected EOF

or by invoking re_comp with a custom byte-string iterator that does not return EOF when \0 is encountered in an input pattern.

It would be nice if there was a way to compile byte strings with embedded NUL bytes. Either by matching PCRE2 verbatim, or via an additional fsm_options flag that indicates that binary strings are accepted in PCRE patterns.

katef commented 3 months ago

Current behaviour introduced in https://github.com/katef/libfsm/commit/6b1a76998362bb6ac07d60cedd01f1ee1bdd637e

We'd expose this as a compile-time flag for libre's API, and conditionally map \0 to TOK_CHAR in the terminal extraction section for sid.