microsoft / sarif-pattern-matcher

Quality domain agnostic regular expression pattern matcher that persists results to SARIF
MIT License
39 stars 18 forks source link

Performance fix to match Legacy speed #672

Closed HulonJenkins closed 2 years ago

HulonJenkins commented 2 years ago

Changes

This PR was created off the same branch as https://github.com/microsoft/sarif-pattern-matcher/pull/665

Update caching so that minimum work is repeated to improve performance of the tool.

Now adds all relevant information as a dictionary. This caching update is now 1:1 with Legacy in terms of performance with the same ruleset.

Started by just caching the int[], then added caching for the String8 and byte[] buffer as well after noticing that there was non-negligible time spent in the String8 conversion function.

From there, tests started failing because some files output raw and base64 decoded text to be scanned, and the cached String8/byte[] weren't being updated accordingly since it was different input text, but from the same file. So updated the caching to group input text (whether it was raw or base64, or anything else) to an String8 and byte[] tuple.

From there, noticed String8 to string comparison failures and lots of time spent in comparing new text coming in to see if it is already cached. Text coming in was basic string, key in dictionary was String8, so comparing the two took very long. Swapped to using basic string as dictionary key to speed up string comparison, and began storing String8, int[], and byte[] together as a tuple.