intel / hyperscan

High-performance regular expression matching library
https://www.hyperscan.io
Other
4.78k stars 710 forks source link

Hyperscan database compilation problems #306

Open pzhang714 opened 3 years ago

pzhang714 commented 3 years ago

First Question: When there are a large number of patterns, it takes a long time to compile. How can we speed up the compilation?

Second Question: When there are a large number of patterns, a lot of temporary memory is needed during hyperscan database compilation. Can I reduce the usage of temporary memory during compilation? If can, how to do it?

xiangwang1 commented 3 years ago

We design Hyperscan as a performance oriented library so it requires comprehensive analysis at compile time to achieve best performance. This means that it may consume longer compile time and more memory for compiling purpose than other regex matching libraries. In general, we don't provide options for compile time tuning.

What's your test setup? How many patterns do you use and what do your patterns look like? It'll be helpful to know the hotspots at compile time for your patterns.

pzhang714 commented 3 years ago
  1. First of all, there is a list containing 100000 patterns. The flag of all patterns is set to HS_FLAG_QUIET, and the 100000 patterns share the same ID, assuming ID is 1;
  2. There is also a list, in which there are 100000 patterns, and the flag of all patterns is also set to HS_FLAG_QUIET, but each pattern has an independent and non repetitive ID. suppose ID starts from 2 to 100001;
  3. Then use the ID of the pattern in 1 (100000 patterns use the same ID), and logically combine with the ID of any pattern in 2 to generate a total of 100000 new patterns, such as 1 & 2, 1 & 3, 1 & 4, 1 & 5,..., 1 & 100001. The flag used in these 100000 patterns is HS_FLAG_COMBINATION.
  4. In this case, it takes a long time to compile and finally generate hyperscan database(even cannot be created normally).
pzhang714 commented 3 years ago

Supplement:

  1. The parrtens in list 1 and list 2 above are all strings without regular expressions,For example, domain name;

  2. When the parents in list 2 are all the same strings, a large amount of memory will be consumed in the compilation process.