Support options for Reachability

ribose-jeffreylau commented 7 years ago

To add an option to test for the given URL to see if it’s reachable, or if the domain is resolvable.

  # domain name resolvable
  validates :my_url, url: {resolvable: true}

  # host is reachable
  validates :my_url, url: {reachable: true}

  # full URL document returns 2xx HTTP status code
  validates :my_url, url: {retrievable: true}

[x] Implement URL resolvability checks
[ ] Implement host reachability checks
[x] Implement content retrievability checks
[ ] Cover everything with specs
[ ] Write some documentation
[ ] Prepare usage examples

Also see #58 and #59.

GildedHonour commented 7 years ago

@ribose-jeffreylau @wacko to verify if a url is reachable, ping a domain and what-not - all this will require sending http/icmp request which can take time. Thus it should be sent asyncronously. We could send a request syncronously and receive a reply quickly, but it's not reliable, sometimes it could fail, take a long time... How would it work together with the normal validation which works syncronously?

ronaldtse commented 7 years ago

@GildedHonour you are right. resolvable is just a DNS request which should be relatively fast, but ICMP/HTTP should be asynchronous. We should probably drop them as validation requirements.

On the other hand, I found this gem https://github.com/weppos/publicsuffix-ruby that uses Mozilla's PublicSuffix for validating top-level domains. We could leverage off it too.

GildedHonour commented 7 years ago

Yeah, but a network request isn't reliable by definition. I'll do a research, maybe there's a way. PublicSuffix may be what we need, if it contains all possible ltd's we'll use it, I'll do a research too.

ronaldtse commented 7 years ago

Yes you're right. Maybe this should all be done asynchronously. However async validations will require some kind of background processing framework... Let us know what your research tells us 👍

GildedHonour commented 7 years ago

1) https://publicsuffix.org doesn't get updated immediately, thus when a new tld comes out, it doesn't get added to publicsuffix right away. It'll be safer to use this https://data.iana.org/TLD/tlds-alpha-by-domain.txt However, why would we need to validate only the "ltd" part for existence? We already have a regex/grammar in the gem for validating a whole URI/IRI which includes the ltd part.

2) Yes, a background validation will require Sidekiq or the like, which is an additional dependency and heavy. And we'll have to create a new column, probably, where the status of a validation of each "url" will be stored. Too much overhead, what do you think?

3) Thus validating it synchronously might be better. Yes, it might fail, take some time which isn't good for user experience but on some webservices there's a validation of this kind as well.

4)

# domain name resolvable
  validates :my_url, url: {resolvable: true}

  # host is reachable
  validates :my_url, url: {reachable: true}

  # full URL document returns 2xx HTTP status code
  validates :my_url, url: {retrievable: true}

They look the same for me. What's the difference between the 1st and 2nd? And when would a user want to use the 3rd instead of the 1st or 2nd?

ribose-jeffreylau commented 7 years ago

@GildedHonour

Perhaps it'll be useful when we want to allow / disallow certain TLDs.
asynchronous validation: I think it's too much too. We haven't figured out what to do with the validation-pending objects.
synchronous validation: Agreed.
This retrievable ensures that at the time of validation, the URL is a "working" one, i.e. not a broken link and whatnot.

GildedHonour commented 7 years ago

@ribose-jeffreylau 1-3 Ok.

Yeah, pinging a domain for existance and being up makes sense. And the 1st and 2nd ones, aren't they the same? When would they be useful?

ronaldtse commented 7 years ago

Resolvable means that the domain name resolves to an IP address; reachable means the resolved IP address can actually responds, for example with a TCP handshake (or an ACK). Retrievable goes further that it ensures it is a correct HTTP response.

GildedHonour commented 7 years ago

I'll implement all 3. However, I'd leave only the 2nd one -- a host exists/url is reachable and responds, not necessarily with 2xx.

ronaldtse commented 7 years ago

Sure. Thanks @GildedHonour !

GildedHonour commented 7 years ago

Added this:

  validates :my_url, url: {resolvability: level}

Where level can be :resorvable, :reachable or :retrievable

ronaldtse commented 7 years ago

@GildedHonour that's great! Could you start a PR to merge that into master? I noticed it's pushed to develop now. It would be better if you separate it into a topic branch so the changes don't get lost in develop.

GildedHonour commented 7 years ago

@ronaldtse yup, will do.

skalee commented 6 years ago

@ronaldtse What's the status of this issue? I see some related logic, documentation is missing for sure… I guess some final touches are required and it's done, isn't it? https://github.com/riboseinc/uri_format_validator/blob/5357f6aa52192cb0f7f74e62e1836e0d6acd465f/lib/uri_format_validator/validators/uri_format_validator.rb#L73-L82

skalee commented 6 years ago

Also tests are missing.

skalee commented 6 years ago

This feature is certainly not done yet, and it's certainly buggy. I'm adding a TODO list to the topmost comment.

skalee commented 6 years ago

@ronaldtse Got a question: how the resolvable option is going to work when the hostname is an IP address? Like in http://127.0.0.1:80/something?

ribose-jeffreylau commented 6 years ago

@skalee I think IP addresses should be immediately resolvable, hence it should immediately pass the resolvability validation.

skalee commented 6 years ago

And what about URIs and URLs with unspecified host, like file:///dev/null or urn:ISSN:0167-6423? Should they pass host reachability check or not?

ronaldtse commented 6 years ago

The intention of "host reachability" is that the path is actually reachable. So the ones with no host are assumed to pass "host reachability" since the path is already reachable without a host.

skalee commented 6 years ago

Hmm, I got your point, but after some thinking I leaning towards opinion that doing it this way will be bit unclear. And since we're talking about host here, it should be merged with host constraints somehow. Following roughly describes what I'm thinking about:

host: ["example.com", "example.test"]
host: :localhost
host: :reachable
host: [:resolvable, nil]

I'll hold on with merging the domain resolvability checks in.

riboseinc / uri_format_validator

Support options for Reachability #12