Open danielittlewood0 opened 3 years ago
I took a closer look into the source code for extractUrls. I didn't realise that t.co URLs were a special case! In particular, this comment seems to imply that the behaviour I'm describing is done intentionally.
// In the case of t.co URLs, don't allow additional path characters.
if (url.match(validTcoUrl)) {
...
Indeed,
twttr.txt.extractUrls("https://t.co/abc?amp=1")
["https://t.co/abc"]
twttr.txt.extractUrls("https://g.co/abc?amp=1")
["https://g.co/abc?amp=1"]
The issue arose from users copying links from twitter and pasting them onto our platform. The link has an additional ?amp=1
, so the effect is that somebody pastes https://t.co/LY1EMFy7TW?amp=1
to us, and the resulting HTML looks like <a>https://t.co/LY1EMFy7TW</a>?amp=1
.
This is not a very nice solution, but I found that if I set
(ruby)
Twitter::TwitterText::Regex::REGEXEN[:valid_tco_url] = /$^/
(js)
twttr.txt.regexen.validTcoUrl = /$^/;
then I get the behaviour I expect.
Describe the bug In URL extraction, I believe URLs missing "www." and with at least one path segment do not correctly recognise the full path and query string. As far as I can tell only the first path segment is matched, and none of the query parameters.
To Reproduce
JS examples
The following examples were checked in the chrome devtools console, running a Rails stack. It uses the twitter-text-js-rails wrapper gem at v1.9.1, so that claims to be using v1.9.1 of twitter-text. I don't have a convenient environment that can upgrade the JS version.
Does not behave as expected:
Behaves as expected:
Ruby examples
I also reproduced this on the 3.1.0 version of the twitter-text gem, which I think is the latest version.
Expected behavior I think the string argument is in all cases a complete URL, and I would have expected the return value to be the whole string (in an array).
Environment Apart from the details above, I'm running Ubuntu.