simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.08k stars 185 forks source link

Change lightgrep scanners to use lookaround assertions #163

Open jonstewart opened 3 years ago

jonstewart commented 3 years ago

Lightgrep has some limited support for lookaround assertions now, which would be very useful in almost all of the lightgrep-based scanners.

simsong commented 3 years ago

What is the advantage of doing this? And how do we test it? And don’t these things dramatically increase processing time?

jonstewart commented 3 years ago
  1. Eliminates a bunch of fiddly post-search code.
  2. With unit tests first, then on data.
  3. It depends on the expression, input data, and regex engine. We can measure but I doubt that it'll have a significant impact. Lightgrep uses a two-byte filter at the tightest rank within the automaton that's before the minimum length of a hit, and a one character forward assertion shouldn't impact that much/at all.
simsong commented 3 years ago

Makes sense. I would really like to have some performance comparisons between the flex scanners and lightgrep. We should be sure to put that in the paper.