Open GildedHonour opened 7 years ago
In the test:
invalid_urls = [
'google.com',
'http://google',
'http://google.looooongltd'
]
However, nowadays there exist quite long tld's such as "network", "solutions", "download", "accountant". The last one contains 10 characters in it.
For tld 63 characters is allowed - https://en.wikipedia.org/wiki/Domain_Name_System#cite_ref-rfc1034_1-2
https://github.com/riboseinc/url_validator/blob/master/lib/validators/url_validator.rb#L11
"ssh" should be added.
@GildedHonour agree with both changes.
I believe this regular expression was derived from RFC 3986 Appendix B which defines URIs:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
We could also switch to ABNF parsing which is more correct using RFC 3986 Appendix A.
Perhaps you can help with correcting this?
@ronaldtse yup.
@ronaldtse should the gem support IRI https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier? https://tools.ietf.org/html/rfc3987
Good one @GildedHonour . We should definitely support IDNs via punycode conversion. Supporting the IRI ABNF syntax (which is an expansion of character space, with a restriction on private unicode only possible in the query element) is also straightforward.
However, we probably don't need to support the "Bidirectional IRIs" for RTL languages, at least for now?
There are some Ruby-based ABNF parsers available so maybe we can leverage them too.
I've found a regex for validating IRI, it's quite long. It might already support "Bidirectional IRIs" and unicode, I'll find out.
I've checkoud out some ruby ABNF gems, they're pretty outdated, 2 and 5 years ago are the latest commits. We might not need to use them and use a regexp instead. Yes, it's pretty complex and hard to maintain, but we'll rarely need to change it and understand what it does. Athough it'll be harder to allow a user specify a list of schemas they want to validation or parts of an IRI and incorporate that into the regexp.
IRI is supported by Addressable gem, see #64. Actually, any reason to match strings against regular expression at all? Isn't enough that given string can be parsed without error by URI parser, either from standard library or Addressable?
@skalee it's totally fine to delegate to a URI parser, but note that, at least the standard library only supports a subset of valid URIs.
Various types of valid URIs according to the RFC (e.g. when there is no scheme, if I remember correctly) many not considered valid in the standard library parser.
A valid parser should really rely on the ABNF syntax of the URI RFC, not a regular expression.
Hmm, currently we do both things: validate against regexp and attempt to parse with stdlib's URI parser. And given string must pass both in order to be considered valid.
@skalee got it. Whatever approach that helps us parse valid URIs and IRIs correctly, is the good approach. The current approach may contain problems. We should use incorporate a list of valid URIs in the test suite to ensure that :-)
At the moment we validate a url using this regexp
https://github.com/riboseinc/url_validator/blob/master/lib/validators/url_validator.rb#L116
However, it's supposed to be more complex:
https://mathiasbynens.be/demo/url-regex https://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python
Like for validating an email, there's no perfect regexp which covers all the cases. The same I figure goes for validating a url.
A research is needed: what regexp should be used? Would it validate all possible urls correctly always?