jtmoon79 / super-speedy-syslog-searcher

Speedily search and merge log messages by datetime
MIT License
38 stars 1 forks source link

consolidate regular expression matching for numeric timezones #288

Open jtmoon79 opened 5 months ago

jtmoon79 commented 5 months ago

Summary

Three regular expressions are made to match the differing numeric timezone formats +00, +0000, +00:00. Combining these three regular expressions into one regular expression would reduce the total number of DTPD!, i.e. DateTimeParseInstr regular expressions.

Current behavior

Currently, three DateTimeParseInstr entries (each holding a compiled regular expression) are created for the three timezone numeric formats; +00, +0000, +00:00. For example, datetime.rs:#L3438-L3471. There are currently 153 regular expressions that might be compiled during a program run.

A run of s4 to process ./logs/other/tests/numbers3.log does not successfully parse any log messages. This means all 153 regular expressions were compiled in unsuccessful attempts to match a datetimestamp. According to the attached flamegraph, 37.80% of that runtime was spent in regex::regex::bytes::Regex::new.

flamegraph-numbers3

Suggested behavior

Create one DateTimeParseInstr (regular expression) to match the three types of numeric timezone formats. This would greatly reduce the maximum number of regular expressions that may be compiled at runtime, and reduce the number "zero block" match attempts made for some log files (match attempts made before a DateTimeParseInstr is chosen for that file).

Tradeoff

There is an advantage to have multiple more precise DateTimeParseInstr (regular expressions); errant matches are less likely.

Additionally, the maximum cost for compiling regular expressions is a fixed amount. As the amount of data processed increases (larger files, more files) then that fixed cost shrinks relative to the total runtime.

... so I'm thinking the multiple timezone matches should be left in-place... 🤔🤔🤔

A good compromise would be allowing regex with "high confidence" matches to consolidate numeric timezones. Whereas matches that are generalized, "low confidence", would have distinguished timezones.

A "high confidence" might be:

[INFO] [20220609T004532+0000] Hello | goodbye

Matched [INFO] [20220609T004532+0000]

Notice the matched brackets and [INFO] and anchored from the line beginning give higher confidence this is the correct datetimestamp.

a "low confidence" might be:

Hello 20220609T004532+0000 goodbye

Matched 20220609T004532+0000

Notice the match is not "anchored" to the line beginning. Notice there are no brackets hinting the datetimestamp is or is not part of the surrounding substrings, e.g. 20220609T004532+0000 might be a substring in the logged message and not a datetimestamp.