Investigate how publicsuffix.org can be used for the URL regex - Githubissues

waldyrious / hash-my-pass

A bookmarklet to generate unique passwords per website, based on a single master password.

http://waldyrious.github.io/hash-my-pass/bookmarklet.min.html

Other

17 stars 4 forks source link

Investigate how publicsuffix.org can be used for the URL regex #14

Open waldyrious opened 10 years ago

waldyrious commented 10 years ago

Either for replacing the domain identification regex or for testing the current implementation against possible edge cases. See https://www.publicsuffix.org/list/

waldyrious commented 9 years ago

Also relevant: https://mathiasbynens.be/demo/url-regex and the only code that passed all the tests: https://gist.github.com/dperini/729294

waldyrious commented 9 years ago

Here's dperini's regex (as of today): /^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:/\\S*)?$/i

waldyrious commented 9 years ago

Simplified version (no IPs, no username:password, only http/https)

/^(?:(?:https?):\/\/)?(?:(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?$/gim

waldyrious commented 9 years ago

The above regex (vizualization) catches all domain suffixes included in the public suffix list as of today.

waldyrious commented 9 years ago

See https://en.wikipedia.org/wiki/User:Waldir/TLDs

waldyrious commented 8 years ago

This (pseudo-code) regex captures all TLDs in the IANA list:

^[^.]+\.([a-z
\u00C0-\u02AF
\u1E00-\u1EFF
\u0400-\u04FF
\u0370-\u03FF
\u0530-\u058F
\u0900-\u139F
\u3040-\u30FF
\u4E00-\u9FFF
\uAC00-\uD7AF
]{2,}|‏[\u0600-\u06FF\u05D0-\u05EF]{2,}‎)$  // this covers RTL scripts (hebrew/arabic)
                                          // and is flanked by (invisible) direction-switching characters.

This was built using [this list of Unicode ranges per script]https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane) as reference.