spamscanner / url-regex-safe

Regular expression matching for URL's. Maintained, safe, and browser-friendly version of url-regex. Resolves CVE-2020-7661 for Node.js servers.
https://forwardemail.net/docs/url-regex-javascript-node-js
MIT License
79 stars 14 forks source link

[fix] Invalid protocols are matched #31

Open ghost opened 5 months ago

ghost commented 5 months ago

Describe the bug

Node.js version: v18.18.2

OS version: macOS 14.2.1

Description: If a valid protocol has extra characters preceding it, the extra characters are included in the match.

Actual behavior

urlRegexSafe({ strict: true }).exec("gaewggwhttp://localhost:3000/derp")

Produces a match that contains the entire string, including "gaewgg".

This also happens when strict is set to false.

Expected behavior

The match that is produced does not include "gaewgg"

Code to reproduce

const urlRegexSafe = require('url-regex-safe')
const match = urlRegexSafe({ strict: true }).exec("gaewggwhttp://localhost:3000/derp")
console.log(match)

Checklist

ghost commented 5 months ago

Okay, I see that the matcher for the protocol will optionally match any a-z string followed by :, and then //. I guess I assumed from the docs that a valid protocol was one that was also known (which I realize isn't well-defined in itself). Maybe the answer to this is to just clarify what the definition of a valid protocol is in the docs. I could also see an option to only match a certain set of well-known or user-provided protocols being a solution too.

iim-norse commented 5 months ago

almost same problem with uppercase Site.Com regex dont understand or Site.com will be ite.com