URL extraction does not capture entire path and query if "www" is omitted.

danielittlewood0 commented 3 years ago

Describe the bug In URL extraction, I believe URLs missing "www." and with at least one path segment do not correctly recognise the full path and query string. As far as I can tell only the first path segment is matched, and none of the query parameters.

To Reproduce

JS examples

The following examples were checked in the chrome devtools console, running a Rails stack. It uses the twitter-text-js-rails wrapper gem at v1.9.1, so that claims to be using v1.9.1 of twitter-text. I don't have a convenient environment that can upgrade the JS version.

Does not behave as expected:

twttr.txt.extractUrls("https://t.co/a?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b")
> ["https://t.co/a"]

Behaves as expected:

twttr.txt.extractUrls("https://www.t.co?amp=1")
> ["https://www.t.co?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b?amp=1")
> ["https://www.t.co/a/b?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b")
> ["https://www.t.co/a/b"]
twttr.txt.extractUrls("https://t.co?amp=1")
> ["https://t.co?amp=1"]

Ruby examples

I also reproduced this on the 3.1.0 version of the twitter-text gem, which I think is the latest version.

Does not behave as expected:

irb(main):002:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a?amp=1")
=> ["https://t.co/a"]
irb(main):004:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b?amp=1")
=> ["https://t.co/a"]
irb(main):005:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b")
=> ["https://t.co/a"]

Expected behavior I think the string argument is in all cases a complete URL, and I would have expected the return value to be the whole string (in an array).

Environment Apart from the details above, I'm running Ubuntu.

danielittlewood0 commented 3 years ago

I took a closer look into the source code for extractUrls. I didn't realise that t.co URLs were a special case! In particular, this comment seems to imply that the behaviour I'm describing is done intentionally.

      // In the case of t.co URLs, don't allow additional path characters.
      if (url.match(validTcoUrl)) {
      ...

Indeed,

twttr.txt.extractUrls("https://t.co/abc?amp=1")
["https://t.co/abc"]
twttr.txt.extractUrls("https://g.co/abc?amp=1")
["https://g.co/abc?amp=1"]

The issue arose from users copying links from twitter and pasting them onto our platform. The link has an additional ?amp=1, so the effect is that somebody pastes https://t.co/LY1EMFy7TW?amp=1 to us, and the resulting HTML looks like <a>https://t.co/LY1EMFy7TW</a>?amp=1.

danielittlewood0 commented 3 years ago

This is not a very nice solution, but I found that if I set

(ruby)

Twitter::TwitterText::Regex::REGEXEN[:valid_tco_url] = /$^/

(js)

twttr.txt.regexen.validTcoUrl = /$^/;

then I get the behaviour I expect.

twitter / twitter-text

URL extraction does not capture entire path and query if "www" is omitted. #346