philterd / phileas

The open source PII and PHI redaction engine
https://www.philterd.ai
Apache License 2.0
22 stars 4 forks source link

Email address filtering needs optimization or relaxed mode #121

Closed RobDickinson closed 2 weeks ago

RobDickinson commented 1 month ago

phileas-benchmark results show that email address detection is more CPU intensive (and requires more memory & stack space) than other regex-based filters.

Performance of single identifiers with 4k values:
mask_credit_cards - 35k calls/sec
mask_bitcoin_addresses - 31k calls/sec
mask_iban_codes - 26k calls/sec
mask_bank_routing_numbers - 27k calls/sec
mask_ssns - 16k calls/sec
mask_phone_numbers - 14k calls/sec
mask_email_addresses - 5k calls/sec 🔥

The current regex is known to be pretty intense -- so it might make sense to have a "relaxed" version that performs better without trading off too much accuracy?

RobDickinson commented 3 weeks ago

@jzonthemtn I'm looking at a few regex variations that show better performance, but I need to do some more testing to see how accuracy is affected in the data I have available.

One interesting bit though -- the email address filter currently does not use the \b...\b fencing that many of the regex-based filters use. Wrapping the current email address regex in \b...\b roughly doubles performance on its own. I think that makes sense since it reduces how greedy some of those matches will be.

👆 Since we're also discussing use of \b from a confidence standpoint (in #120), I thought this was kinda neat to see how much the \b...\b fencing plays into performance too.