hanover-computing / canonicize-url

Get a stable, canonical version of any URL, with DNS and HTTPS checks, redirects, tracker stripping, and canonical link extraction!
GNU Lesser General Public License v3.0
12 stars 0 forks source link

Benchmark built-in RegExp vs. RE2 #4

Closed JaneJeon closed 1 year ago

JaneJeon commented 2 years ago

I tried switching over from the built-in RegExp over to RE2 (basically just https://github.com/JaneJeon/normalize-url-plus/blob/5c4e7b85193a9fe00c25b0c9f47c61a586022d87/utils/strip-trackers.js#L9), and the tests ran fine, and I ran regex over the entire clearURLs ruleset with the CLI with a couple of URLs. Both seemed to be fine, so it's clear that none of the regex (so far) contains backtracking patterns.

However, 1. We cannot be sure future clearURLs rulesets will not have a shitty regex pattern, and 2. The performance may be faster even with the existing ruleset.

But again, we need to benchmark the regex in order to be confident that switching over to RE2 won't completely fuck over our current perf. We need to establish a baseline...