microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
94.73k stars 8.2k forks source link

Evaluate different regular expressions we can use for URL detection #8321

Open zakius opened 3 years ago

zakius commented 3 years ago

Environment

Windows build number: 10.0.20180.1000
Windows Terminal version (if applicable): 1.5.3142.0

Any other software? experienced in weechat running in screen on remote server connected by mosh running in WSL2 Ubuntu, though it shouldn't matter

Steps to reproduce

  1. Encounter https://www.x-kom.pl/g-5/c/158-zasilacze-do-komputera.html?f[1775][52858]=1 string
  2. Hold ctrl key while pointing it

Another example would be https://en.wikipedia.org/wiki/Tibia_(video_game)

Expected behavior

Whole link should be detected, underlined and followed

Actual behavior

https://www.x-kom.pl/g-5/c/158-zasilacze-do-komputera.html?f is detected, underlined and followed

for the second example https://en.wikipedia.org/wiki/Tibia gets detected

DHowett commented 3 years ago

Ah, this is likely because of the regular expression we've used to detect URLs.

zakius commented 3 years ago

I already mentioned in feature request that it's really complex (if not impossible) issue to do this right every time, but I hope it can be made better.

Do you mind if I add other links that are not detected properly? for now I'll post it here in comment but will move to the issue later if that's fine

https://en.wikipedia.org/wiki/Tibia_(video_game) is recognized as https://en.wikipedia.org/wiki/Tibia and this one is especially tricky as most existing link detectors recognize it as https://en.wikipedia.org/wiki/Tibia_(video_game

More detailed discussion about parentheses, braces and so on happened under the feature request if I remember correctly

DHowett commented 3 years ago

You can definitely put more mis-detected URLs here. Good finds. :smile:

DHowett commented 3 years ago

I'm going to rename this to "evaluate different regexes we can use for URL detection" :smile:

zakius commented 3 years ago

Changing the expression to (\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|$!:,.;()\[\]]*) would match these but I'm fairly sure there will be cases when it's too greedy

Overall I don't think there's a 100% bulletproof solution, there's always a chance of a typo after all, but I see at least a room for improvement in case of balanced parentheses in cases like (here's the link you asked for: https://example.com) the closing parenthesis shouldn't be matched more often than it should and that may be the point to stop trying to make it perfect. There's also the temptation to match closing parenthesis only if it's balanced withing the link itself, though there is https://en.wikipedia.org/wiki/Emoticon_;), but once again: this can't be made perfect, so this approach may be more fitting and likely more performant, and we avoid the issue of parentheses being recognized as unbalanced when the opening one is out of buffer (think ([very long text here] here's the link you asked for: https://example.com))

zadjii-msft commented 3 years ago

Another case from #8907:

https://⌘v.ircd.xyz/o.gif

I'm not sure if our pattern is matching unicode characters super well (or if it even can)

kishore-kal commented 3 years ago

Noticed https://duckduckgo.com/?q=[github]&t=ffab&ia=web is encoded to https://duckduckgo.com/?q=%5Bgithub%5D&t=ffab&ia=web on paste but https://⌘v.ircd.xyz/o.gif is not encoded as https://xn--v-cqo.ircd.xyz/o.gif as it does on a browser.

zakius commented 3 years ago

coming back with another case:

https://pl.wikipedia.org/wiki/Izolator_Boguchwała apparently can appear in text and will be handled properly by browsers but is parsed as https://pl.wikipedia.org/wiki/Izolator_Boguchwa by WT

(\b(https?|ftp|file)://[-\p{L}\p{M}*+0-9+&@#/%?=~_|$!:,.;()\[\]]*) would work in Golang, but once again, it should probably be modified for C++ and evaluated against existing test cases

orcmid commented 3 years ago

The TLDR:

There can be problems with this feature involving who is doing what and when. The underlining is essentially a hint from Terminal. The hinting can't be that an Internet Resource Identifier (URI, including URL) is intended, only that there is a resemblance that is carefully touched. It is important not to have this feature be over-presumptive or to do too much work in that respect absent any act on the part of an user.

SOME CONTEXT (longer)

There are a couple of things to be dealt with here. First, URI's are defined entirely using ISO-646 (ASCII) 7-bit codes. These days, that means Unicode page U+0000 to U+007F, not all of which are usable. The generic term is URI, which subsumes URL, and I will use that now,

The idea was that URIs could be communicated carefully in email and text files. A text video case also applies. The specific rules currently in effect (with some supplements) are found in RFC3986.

There are interfaces, these days, that hide the %-encoding by which other codes (such as UTF8 octets greater than 0x7f) and also the ones that are used in URI structure can be passed through such that the URI parsing remains correct. It is not exactly appropriate for Terminal to deal with that as some sort of automatic presumption. It needs to show the text that it is presented. Whether it does %-encoding and other conversions when the CTRL-Click is done is probably appropriate. But not before. This depends on how "press Control-Click to follow ... " is implemented.

There are scheme issues also. http: and https: are well-known URI schemes. When file: is used as a scheme we don't exactly have the http: scheme rules anymore. Ditto mailto: and others, urn:, etc.

These are all reasons for Terminal to do underlining carefully and essentially as a hint. There is not context to determine whether a specific usable URI is intended or not For example, It is not going to figure out, when a compiler error stream is sent through it, how those filename and line-number diagnostics be links into the app that produced the stream, so an IDE can provide views of the files the compiler execution processed (not to mention with regard to pre-processing as well).

In any case, there are more layers here than tend to be observable and it is easy to confuse what some application accepts as a "URI/URL" and what it actually does in conforming to the applicable URI specification. So what has been provided as an user-ciovenience becomes an abstraction pitfall when talking about requirements on a service such as Terminal.

There is some useful information about the technical issues starting in section 2.3 at RFC3305. (Notice how this is clickable or not depending on how this comment is being viewed. RFC7595 indicates some things about generic cases and even registration of additional scheme.

There are, as I said, ways to ensure that not too much of the text is taken to be the URI, and that is by using surround space (spaces not allowed in the URI directly) or using something like parentheses, ( ... ) to bracket the URI, the latter being how that is exactly handled in Markdown, including the raw Markdown of this comment.

Yay295 commented 3 years ago

When file: is used as a scheme we don't exactly have the http: scheme rules anymore.

Especially on Windows, ironically. Because Windows uses \ as its path separator, it's not currently recognized fully. For example with file:///C:\Users\Yay295 only file:///C is underlined. On the other hand if I change the slashes, the entirety of file:///C:/Users/Yay295 is recognized.

Spaces also break the URI detection. This would would be harder to deal with, but I don't know if it would be possible for file: URI's at least to check if the path exists first and if it doesn't, see if matching more text finds a path? I imagine that could be difficult with network paths though, and it's definitely more than just using a regex.

mailinglists35 commented 1 year ago

can an url contain whitespace? if not, then everything from http to first whitespace should be treated as url. or, until a more sophisticated solution is tuned to catch every case, there should be an option in settings: "[x] treat everything from http to first whitespace as url, I accept this is a risk and not all strings might be a correct url."

zakius commented 1 year ago

that may be a reasonable idea actually, fast and basically covers everything that can be covered with more complex options, even if it can sometimes get a bit too eager that should be fine

timblaktu commented 1 year ago

URL detection regex should consider the protocol prefix (http://, https://) optional, and should default to prepending https:// when protocol is missing, or perhaps allowing default protocol to be configured. A lot of terraform/terragrunt code uses custom module source URL syntax that strips off the proto, e.g.:

source = "github.com/lgallard//terraform-aws-secrets-manager?ref=0.6.1"

..and it'd be nice to be able to click this to read the docs on the module.

As described here the above module source URL syntax is a "convenient alias" for the general Git repository address scheme, which, if used, would yield a clickable URL in MS Terminal:

source = "git::https://github.com/lgallard//terraform-aws-secrets-manager?ref=0.6.1"

I just wanted to mention the original problem URL syntax since the users of terraform code often don't own the code, and aren't in control of the syntax they use.

oakkitten commented 10 months ago

Would be great if the regex was configurable.

In an URL detection regex that I've been using on IRC for years, I do:

This works well enough when the URL is inside quotes, parentheses, and also handles the Wikipedia URLs that has quotes within them

floppytree commented 6 months ago

I'm trying to get clickable links to work with the file: scheme and umlauts (non-ASCII) characters in the file name. Unfortunately, as explained above, it doesn't work as hoped for. Two different solutions would solve my problem:

  1. Make the Regex configurable or at least offer regex'es that are not as limited as the current one. Currently, the URL file://C:/ümlaut.txt is only used up to the last /, which is not what I need.
  2. Alternatively, support percent-encoded UTF-8 file names. WT currently opens exactly those file names as are on the screen. So if I encode the URL as file://C:/tmp/%C3%BCmlaut.txt, then the regex is happy, but WT tries to open C:\tmp\%C3%BCmlaut.txt (with the percentages intact) and that obviously doesn't work.
pa-0 commented 2 months ago

Not sure if this warrants its own issue, but it appears to be directly related to this issue:

image

Note in the image that the hyperlink is surrounded by quotation marks on the host screen, but clicking the link results in the browser opening a url formatted like so <url>" incorrectly parsing one of the quotation marks surrounding the URL as a part of it.