Closed RobDickinson closed 2 weeks ago
@jzonthemtn I'm looking at a few regex variations that show better performance, but I need to do some more testing to see how accuracy is affected in the data I have available.
One interesting bit though -- the email address filter currently does not use the \b...\b
fencing that many of the regex-based filters use. Wrapping the current email address regex in \b...\b
roughly doubles performance on its own. I think that makes sense since it reduces how greedy some of those matches will be.
👆 Since we're also discussing use of \b
from a confidence standpoint (in #120), I thought this was kinda neat to see how much the \b...\b
fencing plays into performance too.
phileas-benchmark results show that email address detection is more CPU intensive (and requires more memory & stack space) than other regex-based filters.
The current regex is known to be pretty intense -- so it might make sense to have a "relaxed" version that performs better without trading off too much accuracy?