philterd / phileas

The open source PII and PHI redaction engine
https://www.philterd.ai
Apache License 2.0
24 stars 5 forks source link

Improved performance for email address detection #132

Closed robfromboulder closed 2 months ago

robfromboulder commented 3 months ago

Changes related to #121

There's only three identified cases where there are matching differences between strict and relaxed modes, but this is worth the large difference in performance.

Phileas also uses less stack as a result of these changes -- with the previous implementation, I was seeing a lot of StackOverflowErrors with large strings even when configuring a larger stack size than default.

robfromboulder commented 3 months ago

Here's some phileas-benchmark results to show the performance improvement on my reference system.

java -server -Xmx512M -XX:+AlwaysPreTouch -XX:PerBytecodeRecompilationCutoff=10000 -XX:PerMethodRecompilationCutoff=10000 -jar phileas-benchmark-cmd.jar i_have_a_dream mask_email_addresses 1 15000

CURRENT                         WITH PR CHANGES
===========================     ===========================
string_length,calls_per_sec     string_length,calls_per_sec
0,1305740                       0,1304860
1,1304026                       1,1290220
2,1196293                       2,1215300
4,1130073                       4,1117353
8,961333                        8,1000493
16,746520                       16,851080
32,516473                       32,683493
64,317826                       64,483360
128,148720                      128,309466
256,73420                       256,174953
512,39121                       512,92220
768,26249                       768,62980
1024,19802                      1024,47763
1280,15510                      1280,39106
1536,13032                      1536,32700
1792,11209                      1792,28186
2048,9898                       2048,24760
3072,6450                       3072,16211
4096,4885                       4096,12219

👆 This benchmark uses onlyStrictMatches = true (default mode). With relaxed mode, performance is significantly higher but with some loss in accuracy.

jzonthemtn commented 2 months ago

That's fantastic! Thanks @RobDickinson!

Do you think onlyStrictMatches = false should make a match have a lower confidence? Say, strict is 0.9 and otherwise it's something lower? I think I could see the case being made both ways - as the email filter is configured, the probability is 0.9.