Open starius opened 2 months ago
thanks @starius
@javierron are you able to code-review this one?
Hi @javierron !
Thank you very much for the review! I addressed the feedback, see my comments.
Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern
Can you elaborate on this, please? analizePattern
already determines automatically if a pattern is a literal:
prefix, complete := re.LiteralPrefix()
if complete {
return []string{prefix}, nil
}
(I put this code after checking presence in pattern2literals
table, because it is faster than compiling the regexp, so regexp compilation is not done for cases from pattern2literals
table.)
We still need both tables pattern2literals
and pattern2mainLiteral
, because we need to do something with non-literal regexps. Maybe pattern2literals
could be generated automatically, but such a code would be quite complicated, so I think it is not worth it, to generate such a small table. And for pattern2mainLiteral
I don't know how to automate its creation even in theory: we need to find a long enough literal in regexp which is present in any match. It requires in-depth analysis of the structure of regexp. And the hardcoded table is even smaller...
Do you have ideas how to do it in an elegant way?
@starius thanks for the updates
@javierron let me know how we should proceed thanks.
Hi @starius Thanks for the response.
I mean, we could have two sets: one set for literals, which are checked against using the trie solution; and another set for actual patterns, which are checked against by using a regex chain (regex1|regex2|...|regexn). This would avoid the issue of having to update the program every time a new crawler with a pattern is added.
@monperrus Does that make any sense?
Hi @javierron @monperrus !
I think it is possible to implement func analyzePattern
without hardcoded tables. Go has package regexp/syntax which provides introspection of regular expressions. I think it is possible to use it to convert a pattern to the list of all possible matching strings (what pattern2literals
does) and to extract longest common sub-string from a regexp (what pattern2mainLiteral
does). I'll try to implement this in few days and update the PR.
Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching.
I've applied these simplifications and modifications. There are two tables: one replaces a pattern with a list of possible search strings, while the other matches rare patterns requiring regexp with specific strings indicating their possible presence in text. The specific string is needed to know when to run the regexp.
Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format).
All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match.
Benchmark comparison:
New implementation is 40 times faster!