Internal Query Preprocessing

remyschwab / TetRex

Efficient search of complex motifs in kmer space

BSD 3-Clause "New" or "Revised" License

0 stars 0 forks source link

Open remyschwab opened 1 year ago

remyschwab commented 1 year ago

After the user inputs a query, we should internally split it between what is used to generate a kNFA and what is used during verification by RE2.

Some motifs contain anchors like < and > ie. ^ and $ for motifs that occur at the N and C terminus. This can be ignored by the IBF but not RE2.

Also the entire regex should be in a capture group (REGEX) so that the match can be reported as a whole

remyschwab commented 1 year ago

With the right encoding (ie a limit on the kmer length for amino acids) we can encode anchors and optimize the search at the IBF stage

remyschwab commented 1 year ago

Common operations like explicit count ranges for wildcards are common for biological motifs but not common in any POSIX regex flavour