twitter / twitter-text

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
https://developer.twitter.com/en/docs/counting-characters
Apache License 2.0
3.07k stars 510 forks source link

URL extraction does not capture entire path and query if "www" is omitted. #346

Open danielittlewood0 opened 3 years ago

danielittlewood0 commented 3 years ago

Describe the bug In URL extraction, I believe URLs missing "www." and with at least one path segment do not correctly recognise the full path and query string. As far as I can tell only the first path segment is matched, and none of the query parameters.

To Reproduce

JS examples

The following examples were checked in the chrome devtools console, running a Rails stack. It uses the twitter-text-js-rails wrapper gem at v1.9.1, so that claims to be using v1.9.1 of twitter-text. I don't have a convenient environment that can upgrade the JS version.

I also reproduced this on the 3.1.0 version of the twitter-text gem, which I think is the latest version.

irb(main):002:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a?amp=1")
=> ["https://t.co/a"]
irb(main):004:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b?amp=1")
=> ["https://t.co/a"]
irb(main):005:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b")
=> ["https://t.co/a"]

Expected behavior I think the string argument is in all cases a complete URL, and I would have expected the return value to be the whole string (in an array).

Environment Apart from the details above, I'm running Ubuntu.

danielittlewood0 commented 3 years ago

I took a closer look into the source code for extractUrls. I didn't realise that t.co URLs were a special case! In particular, this comment seems to imply that the behaviour I'm describing is done intentionally.

      // In the case of t.co URLs, don't allow additional path characters.
      if (url.match(validTcoUrl)) {
      ...

Indeed,

twttr.txt.extractUrls("https://t.co/abc?amp=1")
["https://t.co/abc"]
twttr.txt.extractUrls("https://g.co/abc?amp=1")
["https://g.co/abc?amp=1"]

The issue arose from users copying links from twitter and pasting them onto our platform. The link has an additional ?amp=1, so the effect is that somebody pastes https://t.co/LY1EMFy7TW?amp=1 to us, and the resulting HTML looks like <a>https://t.co/LY1EMFy7TW</a>?amp=1.

danielittlewood0 commented 3 years ago

This is not a very nice solution, but I found that if I set

(ruby)

Twitter::TwitterText::Regex::REGEXEN[:valid_tco_url] = /$^/

(js)

twttr.txt.regexen.validTcoUrl = /$^/;

then I get the behaviour I expect.