spamscanner / url-regex-safe

Regular expression matching for URL's. Maintained, safe, and browser-friendly version of url-regex. Resolves CVE-2020-7661 for Node.js servers.
https://forwardemail.net/docs/url-regex-javascript-node-js
MIT License
79 stars 14 forks source link

Matches email addresses (within a larger string) with default options #5

Closed surlyhacker closed 3 years ago

surlyhacker commented 3 years ago

If you take this string: "Please send an email to test@test.com, providing ONLY your first name", this module will return "test.com" as a match.

Is this intended behavior and I am just misunderstanding how it's supposed to work?

I see that the "auth" option (if set to true) may result in undesired matches to email addresses, but it defaults to false. I don't see a test case in the tests for this scenario. (see comment below)

What's interesting is that you commented on this exact problem in a relevant url-regex issue:

We're using this library to pull urls out of a user input string. We're currently running into a problem where it returns email addresses.

'This is an email@email.com'.match(urlRegex({ exact: false, strict: false }));
// returns ['email@email.com']

I think it should return no matches in this case.

...

This issue is fixed in my maintained and modern version of this package at https://github.com/niftylettuce/url-regex-safe. You should be able to switch from url-regex to url-regex-safe now. See the updated list of options as I added some new ones, and changed a few defaults to more sensible ones (since not everyone is parsing Markdown for instance).

Here are the relevant options that might impact this test, although they are all just their default values so wouldn't need to be explicitly set in a test: exact: false strict: false auth: false

surlyhacker commented 3 years ago

Actually there is a test similar to this that seems counterintuitive given my limited understanding of the interplay between the options and the aforementioned comment on that issue with url-regex:

test('parses similar to Gmail by default', (t) => {
  t.deepEqual(
    "foo@bar.com [foo]@bar.com foo bar @foob.com 'text@example.com, some text'".match(
      urlRegex()
    ),
    ['bar.com', 'bar.com', 'foob.com', 'example.com']
  );
});

Is that test case supposed to be finding: 1) URLs with an Authority component (hereafter referred to as "Authority URLs") that are missing a scheme and have no password? (e.g. normally http://foo@bar.com) or 2) Email address URLs that are missing a scheme? (e.g. normally mailto:foo@bar.com)

I would suggest more detailed documentation to help eliminate ambiguity if this is in fact not a bug. Although if it is not a bug, that brings up the question of whether this module can currently be configured to behave as desired in this issue report, or if that would be an enhancement. Personally I see the value in use cases that separate email address from other types of URLs, and would want an option to explicitly include/exclude them.

Moreover (and perhaps most importantly), in a string like "Please use this Authority URL with username: user@foo.com to access the site", I personally would not even consider that to be a valid Authority URL in any context whatsoever. A string with that syntax is an email address. In the modern world in the limited scenarios where Authority URLs are still used, they should always be be prefixed with a scheme and if they are not, they should not be considered Authority URLs regardless of what the spec (RFCs) say (in a context where email addresses may be present).

Edit: Also I don't see why "bar.com" is the matched substring for "foo@bar.com" either. If "foo@bar.com" is considered a URL, shouldn't the entire thing be matched and not just the part after the "@"? Another way of saying this is that if you think you have found a valid URL within a given string, but that URL is preceded by an "@" sign, then you have either not found a valid URL or you have found only a portion of a URL. There are very few non-whitespace characters that I would say are valid to be directly preceding a URL. Perhaps bracket/quote characters might be acceptable e.g. ( [ { ` ' " <

I guess it (characters that can precede a URL) depends on whether the string being parsed is generally human-readable stuff or if it's code or machine-readable.

niftylettuce commented 3 years ago

I think you're looking for email-regex-safe instead. Check out https://spamscanner.net and its source/tests for more insight into how I workaround parsing things out.

niftylettuce commented 3 years ago

You could use email-regex-safe to parse out the emails with blank spaces, e.g. .replace(EMAIL_REGEX, ' ') and then follow it up with your result.match(URL_REGEX)