Open remyschwab opened 1 year ago
With the right encoding (ie a limit on the kmer length for amino acids) we can encode anchors and optimize the search at the IBF stage
Common operations like explicit count ranges for wildcards are common for biological motifs but not common in any POSIX regex flavour
After the user inputs a query, we should internally split it between what is used to generate a kNFA and what is used during verification by RE2.
Some motifs contain anchors like < and > ie. ^ and $ for motifs that occur at the N and C terminus. This can be ignored by the IBF but not RE2.
Also the entire regex should be in a capture group
(REGEX)
so that the match can be reported as a whole