Open waldyrious opened 10 years ago
Also relevant: https://mathiasbynens.be/demo/url-regex and the only code that passed all the tests: https://gist.github.com/dperini/729294
Here's dperini's regex (as of today): /^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:/\\S*)?$/i
Simplified version (no IPs, no username:password, only http/https)
/^(?:(?:https?):\/\/)?(?:(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?$/gim
The above regex (vizualization) catches all domain suffixes included in the public suffix list as of today.
This (pseudo-code) regex captures all TLDs in the IANA list:
^[^.]+\.([a-z
\u00C0-\u02AF
\u1E00-\u1EFF
\u0400-\u04FF
\u0370-\u03FF
\u0530-\u058F
\u0900-\u139F
\u3040-\u30FF
\u4E00-\u9FFF
\uAC00-\uD7AF
]{2,}|[\u0600-\u06FF\u05D0-\u05EF]{2,})$ // this covers RTL scripts (hebrew/arabic)
// and is flanked by (invisible) direction-switching characters.
This was built using [this list of Unicode ranges per script]https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane) as reference.
Either for replacing the domain identification regex or for testing the current implementation against possible edge cases. See https://www.publicsuffix.org/list/