The regex is supposed to be more complex than we currently have

GildedHonour commented 7 years ago

At the moment we validate a url using this regexp

 %r{^#{
          protocol
        }[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$}iux

https://github.com/riboseinc/url_validator/blob/master/lib/validators/url_validator.rb#L116

However, it's supposed to be more complex:

https://mathiasbynens.be/demo/url-regex https://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python

Like for validating an email, there's no perfect regexp which covers all the cases. The same I figure goes for validating a url.

A research is needed: what regexp should be used? Would it validate all possible urls correctly always?

GildedHonour commented 7 years ago

In the test:

    invalid_urls = [
      'google.com',
      'http://google',
      'http://google.looooongltd'
    ]

However, nowadays there exist quite long tld's such as "network", "solutions", "download", "accountant". The last one contains 10 characters in it.

For tld 63 characters is allowed - https://en.wikipedia.org/wiki/Domain_Name_System#cite_ref-rfc1034_1-2

GildedHonour commented 7 years ago

https://github.com/riboseinc/url_validator/blob/master/lib/validators/url_validator.rb#L11

"ssh" should be added.

ronaldtse commented 7 years ago

@GildedHonour agree with both changes.

I believe this regular expression was derived from RFC 3986 Appendix B which defines URIs:

     ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9

We could also switch to ABNF parsing which is more correct using RFC 3986 Appendix A.

Perhaps you can help with correcting this?

GildedHonour commented 7 years ago

@ronaldtse yup.

GildedHonour commented 7 years ago

@ronaldtse should the gem support IRI https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier? https://tools.ietf.org/html/rfc3987

ronaldtse commented 7 years ago

Good one @GildedHonour . We should definitely support IDNs via punycode conversion. Supporting the IRI ABNF syntax (which is an expansion of character space, with a restriction on private unicode only possible in the query element) is also straightforward.

However, we probably don't need to support the "Bidirectional IRIs" for RTL languages, at least for now?

ronaldtse commented 7 years ago

There are some Ruby-based ABNF parsers available so maybe we can leverage them too.

GildedHonour commented 7 years ago

I've found a regex for validating IRI, it's quite long. It might already support "Bidirectional IRIs" and unicode, I'll find out.

I've checkoud out some ruby ABNF gems, they're pretty outdated, 2 and 5 years ago are the latest commits. We might not need to use them and use a regexp instead. Yes, it's pretty complex and hard to maintain, but we'll rarely need to change it and understand what it does. Athough it'll be harder to allow a user specify a list of schemas they want to validation or parts of an IRI and incorporate that into the regexp.

skalee commented 6 years ago

IRI is supported by Addressable gem, see #64. Actually, any reason to match strings against regular expression at all? Isn't enough that given string can be parsed without error by URI parser, either from standard library or Addressable?

ronaldtse commented 6 years ago

@skalee it's totally fine to delegate to a URI parser, but note that, at least the standard library only supports a subset of valid URIs.

Various types of valid URIs according to the RFC (e.g. when there is no scheme, if I remember correctly) many not considered valid in the standard library parser.

ronaldtse commented 6 years ago

A valid parser should really rely on the ABNF syntax of the URI RFC, not a regular expression.

skalee commented 6 years ago

Hmm, currently we do both things: validate against regexp and attempt to parse with stdlib's URI parser. And given string must pass both in order to be considered valid.

ronaldtse commented 6 years ago

@skalee got it. Whatever approach that helps us parse valid URIs and IRIs correctly, is the good approach. The current approach may contain problems. We should use incorporate a list of valid URIs in the test suite to ensure that :-)

riboseinc / uri_format_validator

The regex is supposed to be more complex than we currently have #22