intel / hyperscan

High-performance regular expression matching library
https://www.hyperscan.io
Other
4.71k stars 705 forks source link

Literals shouldn't need to be NULL-terminated. #386

Open rmharris opened 1 year ago

rmharris commented 1 year ago

The documentation for hs_compile_lit() states that an expression must be NULL-terminated even though its length must be supplied. Looking though hs_compile_lit_multi_int() I see that nothing after the declared length is used by https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hs.cpp#L362-L364 However, there follows a call to https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hs.cpp#L373-L374 then https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/parser/logical_combination.cpp#L143-L144 and finally https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hs.cpp#L530-L536 Given that the documentation explicitly allows for \0 to appear in the expression it seems like there is some further bug here, perhaps the absence of a hs_expression_lit_info().

See also #205.

hongyang7 commented 1 year ago

@rmharris Thanks for pointing this. hs_expression_lit_info() is absent. Will fix that.

elmart-devo commented 1 year ago

I've also been bitten by this, as before it was possible to search for nul bytes using nul literals (\u0000) in literal expressions (compile_lit & compile_lit_multi), and now that is not possible anymore, and you have to resort to escaped nuls (\\x{00}) and normal expressions (compile & compile_multi) to be able to do that.

So, I'm glad to know this will be fixed.

BTW, @hongyang7, maybe the chance can be taken to clarify compile_lit and compile_lit_multi doc (https://intel.github.io/hyperscan/dev-reference/api_files.html#c.hs_compile_lit). As it currently stands, it first says that expression will be NULL-terminated:

 * @param expression
 *      The NULL-terminated expression to parse.
 [...]

But laters says even more explicitly that NULs are allowed within expression:

* @param len
*      The length of the text content of the pure literal expression. As the
*      text content indicated by @p expression is treated as single character
*      one by one, the special terminating character `\0` should be allowed
*      to appear in expression, and not treated as a terminator for a string.
*      Thus, the end of a pure literal expression cannot be indicated by
*      identifying `\0`, but by counting to the expression length.

That seems contradictory, and IMO it would be best to change expression doc to make it clear that NULL-termination is not needed.

Thx everybody involved for this great project.

hongyang7 commented 1 year ago

@elmart-devo Good point. Thanks.