mvdan / xurls

Extract urls from text
BSD 3-Clause "New" or "Revised" License
1.19k stars 116 forks source link

xurls does not recognize valid IRIs #58

Closed gibson042 closed 2 years ago

gibson042 commented 2 years ago

Originally reported at https://github.com/keybase/client/issues/22453 as a failure of the Keybase client to linkify https://en.wikipedia.org/wiki/Dunning–Kruger_effect .

The issue seems to stem from pathCont being too narrowly defined; it does not include the full range specified in RFC 3987:

   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"

   …

   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

"https://en.wikipedia.org/wiki/Dunning–Kruger_effect" contains U+2013 EN DASH , which is in the %xA0-D7FF range but has a General_Category of Dash_Punctuation (Pd) (erroneously not included in xurls.go midChar/endChar/etc.).

mvdan commented 2 years ago

Thanks for reporting, and for the detailed investigation! Would you like to send a PR?