Closed surlyhacker closed 3 years ago
Actually there is a test similar to this that seems counterintuitive given my limited understanding of the interplay between the options and the aforementioned comment on that issue with url-regex:
test('parses similar to Gmail by default', (t) => {
t.deepEqual(
"foo@bar.com [foo]@bar.com foo bar @foob.com 'text@example.com, some text'".match(
urlRegex()
),
['bar.com', 'bar.com', 'foob.com', 'example.com']
);
});
Is that test case supposed to be finding: 1) URLs with an Authority component (hereafter referred to as "Authority URLs") that are missing a scheme and have no password? (e.g. normally http://foo@bar.com) or 2) Email address URLs that are missing a scheme? (e.g. normally mailto:foo@bar.com)
I would suggest more detailed documentation to help eliminate ambiguity if this is in fact not a bug. Although if it is not a bug, that brings up the question of whether this module can currently be configured to behave as desired in this issue report, or if that would be an enhancement. Personally I see the value in use cases that separate email address from other types of URLs, and would want an option to explicitly include/exclude them.
Moreover (and perhaps most importantly), in a string like "Please use this Authority URL with username: user@foo.com to access the site", I personally would not even consider that to be a valid Authority URL in any context whatsoever. A string with that syntax is an email address. In the modern world in the limited scenarios where Authority URLs are still used, they should always be be prefixed with a scheme and if they are not, they should not be considered Authority URLs regardless of what the spec (RFCs) say (in a context where email addresses may be present).
Edit: Also I don't see why "bar.com" is the matched substring for "foo@bar.com" either. If "foo@bar.com" is considered a URL, shouldn't the entire thing be matched and not just the part after the "@"? Another way of saying this is that if you think you have found a valid URL within a given string, but that URL is preceded by an "@" sign, then you have either not found a valid URL or you have found only a portion of a URL. There are very few non-whitespace characters that I would say are valid to be directly preceding a URL. Perhaps bracket/quote characters might be acceptable e.g. ( [ { ` ' " <
I guess it (characters that can precede a URL) depends on whether the string being parsed is generally human-readable stuff or if it's code or machine-readable.
I think you're looking for email-regex-safe
instead. Check out https://spamscanner.net and its source/tests for more insight into how I workaround parsing things out.
You could use email-regex-safe
to parse out the emails with blank spaces, e.g. .replace(EMAIL_REGEX, ' ')
and then follow it up with your result.match(URL_REGEX)
If you take this string: "Please send an email to test@test.com, providing ONLY your first name", this module will return "test.com" as a match.
Is this intended behavior and I am just misunderstanding how it's supposed to work?
I see that the "auth" option (if set to true) may result in undesired matches to email addresses, but it defaults to false.
I don't see a test case in the tests for this scenario.(see comment below)What's interesting is that you commented on this exact problem in a relevant url-regex issue:
...
Here are the relevant options that might impact this test, although they are all just their default values so wouldn't need to be explicitly set in a test: exact: false strict: false auth: false