Closed robfromboulder closed 2 months ago
Here's some phileas-benchmark
results to show the performance improvement on my reference system.
java -server -Xmx512M -XX:+AlwaysPreTouch -XX:PerBytecodeRecompilationCutoff=10000 -XX:PerMethodRecompilationCutoff=10000 -jar phileas-benchmark-cmd.jar i_have_a_dream mask_email_addresses 1 15000
CURRENT WITH PR CHANGES
=========================== ===========================
string_length,calls_per_sec string_length,calls_per_sec
0,1305740 0,1304860
1,1304026 1,1290220
2,1196293 2,1215300
4,1130073 4,1117353
8,961333 8,1000493
16,746520 16,851080
32,516473 32,683493
64,317826 64,483360
128,148720 128,309466
256,73420 256,174953
512,39121 512,92220
768,26249 768,62980
1024,19802 1024,47763
1280,15510 1280,39106
1536,13032 1536,32700
1792,11209 1792,28186
2048,9898 2048,24760
3072,6450 3072,16211
4096,4885 4096,12219
👆 This benchmark uses onlyStrictMatches = true
(default mode). With relaxed mode, performance is significantly higher but with some loss in accuracy.
That's fantastic! Thanks @RobDickinson!
Do you think onlyStrictMatches = false
should make a match have a lower confidence? Say, strict is 0.9
and otherwise it's something lower? I think I could see the case being made both ways - as the email filter is configured, the probability is 0.9
.
Changes related to #121
EmailAddress.onlyStrictMatches
(true by default, uses original regex)EmailAddress.onlyStrictMatches = false
\b...\b
fencing around the regex)EmailAddress.onlyStrictMatches = false
, email address detection is 4x faster than beforeThere's only three identified cases where there are matching differences between strict and relaxed modes, but this is worth the large difference in performance.
Phileas also uses less stack as a result of these changes -- with the previous implementation, I was seeing a lot of StackOverflowErrors with large strings even when configuring a larger stack size than default.