Bug: Issues with No Bare URLs rule

vburzynski commented 5 months ago

[x] I have verified that I am on the latest version of the Linter

Describe the Bug

The "No bare URLs" rule incorrectly parses some URLs

some examples below borrowed from https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/
some examples below borrowed from https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
the no bare URIs version of this rule also misses various URI schemes that have optional authorities like mailto:joe@example.org

How to Reproduce

Steps to reproduce the behavior:

Example to reproduce issue with

https://web.archive.org/web/20240402173118/https://www.apple.com/
https://en.wikipedia.org/wiki/Möbius_strip
https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en
https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top
http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
http://127.0.0.1/index.html
ldap://[2001:db8::7]/c=GB?objectClass?one
news:comp.infosystems.www.servers.unix
mailto:John.Doe@example.com

Result...

<https://web.archive.org/web/20240402173118/https>:<//www.apple.com/>
<https://en.wikipedia.org/wiki/M>öbius_strip
<https://zh.wikipedia.org/wiki/Wikipedia>:关于中文维基百科/en
<https://john.doe>@<www.example.com:123/forum/questions/?tag=networking&order=newest#top>
http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
ldap://[2001:db8::7]/c=GB?objectClass?one
news:comp.infosystems.<www.servers.unix>
mailto:John.Doe@example.com

I run into issues with links to archive.org a lot. (and is the main reason for opening this ticket)

<https://web.archive.org/web/20240402173118/https>:<//www.apple.com/>

Wikipedia examples get characters cut off

<https://en.wikipedia.org/wiki/M>öbius_strip
<https://zh.wikipedia.org/wiki/Wikipedia>:关于中文维基百科/en

URL with a user encoded into it is detected as two links

<https://john.doe>@<www.example.com:123/forum/questions/?tag=networking&order=newest#top>

IPV6 example is missed

http://[2001:db8:85a3::8a2e:370:7334]/foo/bar

parts of a URI can be incorrectly detected as a URL

news:comp.infosystems.<www.servers.unix>

Expected Behavior

A clear and concise description of what you expected to happen.

Expected output if applicable:

<https://web.archive.org/web/20240402173118/https://www.apple.com/>
<https://en.wikipedia.org/wiki/Möbius_strip>
<https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en>
<http://[2001:db8:85a3::8a2e:370:7334]/foo/bar>
<https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top>
<ldap://[2001:db8::7]/c=GB?objectClass?one>
<news:comp.infosystems.www.servers.unix>
<mailto:John.Doe@example.com>

Screenshots

If applicable, add screenshots to help explain your problem.

Device

[x] Desktop
[ ] Mobile

Additional Context

Add any other context about the problem here.

pjkaufman commented 5 months ago

Hey @vburzynski . I am not the most familiar with many of these formats that have been provided (mailto, I kind of know, but that may not pass regular URI parsing), the others I know almost nothing about. So that may need a parser improvement for the regex logic.

pjkaufman commented 5 months ago

The URI and URL syntax is very large, so I do caution about expecting things to work with all kinds of URLs since most libraries do not work with all URL types and URI schemes (almost anything passes as a URI or URL which makes things a real pain to deal with).

That being said, I am open to suggestions around how to improve the identification of URIs and URLs in a file. If there is a parser out there that can be reused or a spec that makes it clear on how to manually parse a URI/URL that would be very helpful for addressing several of these scenarios.

pjkaufman commented 5 months ago

These are more dev notes or my understanding of why things are happening the way that they are (feel free to give feedback and ideas on how to handle these):

I run into issues with links to archive.org a lot. (and is the main reason for opening this ticket)
<https://web.archive.org/web/20240402173118/https>:<//www.apple.com/>

This would be caused by the nesting of a URL in the URL. How that would be fixed would likely be to swap to a parsing library that already handles this scenario since getting the regex to handle this is likely not happening since that could cause a lot of performance issues.

Wikipedia examples get characters cut off

<https://en.wikipedia.org/wiki/M>öbius_strip
<https://zh.wikipedia.org/wiki/Wikipedia>:关于中文维基百科/en

The problem here seems to be a use of invalid URL characters (browsers have come to accept and interpret them, but according to the link provided that mentions invalid URL characters, these would be invalid (minus the colon, that just causes problems when found in the URL other than as a part of the scheme).

2 potentials fixes for this:

Add valid letter unicode to the regex for valid characters
Add colon to be valid anywhere in the URL (?)

URL with a user encoded into it is detected as two links
<https://john.doe>@<www.example.com:123/forum/questions/?tag=networking&order=newest#top>

I would need a rundown on exactly what is valid around this since I know almost nothing about this except that it is a method to get people onto malicious websites.

IPV6 example is missed
http://[2001:db8:85a3::8a2e:370:7334]/foo/bar

I know virtually nothing about IPv6, so I would a need a spec for this to make sure it gets handled properly.

parts of a URI can be incorrectly detected as a URL
news:comp.infosystems.<www.servers.unix>

I have no clue where this URI is coming from, but the issue is likely that part of this is a valid URL, so the URL regex matches this as a URL which prevents any URI detection.

Note: while it would be nice to say that because the potential URL match comes right after a period it cannot be a URL, but that is only true in English and other languages that actually have whitespace between characters and other words or parts of the text. Unfortunately, CJK and other similar languages have gone with no whitespace or limited whitespace which means I cannot assume anything about this.

The only fix I can think of for this is a full blown URL parser, but I did not find any before which is why I settled on regex.

Martin-Milbradt commented 4 months ago

URL with a user encoded into it is detected as two links

https://john.doe@<www.example.com:123/forum/questions/?tag=networking&order=newest#top>

I would need a rundown on exactly what is valid around this since I know almost nothing about this except that it is a method to get people onto malicious websites.

Mainstream example: medium.com uses @ in URLs (Linter also breaks this): https://medium.com/@robertwiblin

pjkaufman commented 4 months ago

URL with a user encoded into it is detected as two links

https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top

I would need a rundown on exactly what is valid around this since I know almost nothing about this except that it is a method to get people onto malicious websites.

Mainstream example: medium.com uses @ in URLs (Linter also breaks this): https://medium.com/@robertwiblin

Hey @Martin-Milbradt. Thanks for pointing out this other scenario that is having problem. I think that is a little out of context with what is quoted since the issue referenced is different. One is referencing a user login in the URL and the other is just an @ in a URL. They will likely have to be addressed separately.

platers / obsidian-linter