json-schema-org / JSON-Schema-Test-Suite

A language agnostic test suite for the JSON Schema specifications
MIT License
625 stars 209 forks source link

test: hostname format check fails on empty string #677

Open jvtm opened 1 year ago

jvtm commented 1 year ago

Assert that hostname format validation fails gracefully on empty strings.

This is especially for Python jsonschema library that raises an unexpected ValueError exception on hostname check (python-jsonschema/jsonschema#1121).

Adds similar test for:

Julian commented 1 year ago

So, reading the spec(s), I'm not sure this is correct actually!

In particular, excluding draft 2020-12 which I'll get back to, earlier drafts all say the equivalent of:

A string instance is valid against this attribute if it is a valid representation for an Internet host name, as defined by RFC 1034, section 3.1 [RFC1034].

(see e.g. here though as I say you can flip through all the drafts and they say more or less the same thing).

Looking at that section, it says:

Each node has a label, which is zero to 63 octets in length. Brother nodes may not have the same label, although the same label can be used for nodes which are not brothers. One label is reserved, and that is the null (i.e., zero length) label used for the root.

which seems to say that the empty string is indeed allowed, and in the context of DNS, refers to the root.

Draft 2020-12 on the other hand was changed to reference RFC 1123 and not RFC 1034 (some context is here):

hostname: As defined by RFC 1123, section 2.1 [RFC1123] [...]

where RFC 1123 in that section seems to simply pass things along to RFC 952:

The syntax of a legal Internet host name was specified in RFC-952 [DNS:4]. One aspect of host name syntax is hereby changed: the restriction on the first character is relaxed to allow either a letter or a digit. Host software MUST support this more liberal syntax.

where the grammar in the latter seems to say:

<hname> ::= <name>*["."<name>]
<name>  ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]

which also seems to allow the empty string.

Have a look and if you have some existing domain knowledge let me know, but yeah my first read looks like these are actually valid under the quoted specs.

gregsdennis commented 1 year ago

Does this mean we should keep the tests, but ensure that validation passes instead?

Julian commented 1 year ago

To my read, yes.

jvtm commented 1 year ago

Uhh, what a can of worms.. ;)

Some docs here and there limit DNS labels to be 1-63 octets (and some 0-63), and full entry max 253. But no mentions of allowing / denying no labels at all. And then the usual "use a single dot for the root".

I wonder how various real life tools work.

Julian commented 1 year ago

Yes well no good deed goes unpunished certainly. The authoritative thing here is the spec(s) / RFCs, not what tools do for better or worse, though certainly looking through them can be insightful anyhow -- I certainly saw one allowing an empty string.

jvtm commented 1 year ago

I'll reverse the check and reword the commit and PR later today.

Is it worth adding any other checks?

Julian commented 1 year ago

None that I can immediately think of, let's see if anyone else chimes in between now and then, and regardless thanks for raising the PR!

gregsdennis commented 1 year ago

I'll give this a run through my implementation tomorrow.

jvtm commented 1 year ago

Looks like in Python jsonschema idn-hostname for both zero length string and a single dot fail.

The check uses idna library https://github.com/kjd/idna/blob/8c703c782bf8144091c63470fcdbf29ac47eeb15/idna/core.py#L340-L370

Julian commented 1 year ago

(That's not surprising to me -- the question is usually twofold --

In that case, I'm also somewhat trusting the JSON Schema spec which says "all hostnames are valid idn-hostnames" though that too may be wrong! Tricky tricky...)

karenetheridge commented 1 year ago

FWIW, my implementation does not accept empty string:

$; json-schema-eval --validate_formats
enter data instance, followed by ^D:
""
enter schema, followed by ^D:
{"type":"string","format":"hostname"}
^D
{
  "errors": [
    {
      "error": "not a hostname",
      "instanceLocation": "",
      "keywordLocation": "/format"
    }
  ],
  "valid": false
}

..which is just a wrapper around https://metacpan.org/pod/Data::Validate::Domain#is_domain($domain,-\%options) which says it implements the RFCs.

gregsdennis commented 1 year ago

Weirdly, my implementation says hostname with empty string is valid, but idn-hostname with empty string is invalid.

Ah! For idn-hostname, I'm using System.Uri.CheckHostName() provided by .Net. But for hostname, I'm using the regex ^[a-zA-Z][-.a-zA-Z0-9]{0,22}[a-zA-Z0-9]$ which is very likely incorrect; I'm not sure where I got it (probably SO).

jvtm commented 1 year ago

I don't have time to work on this at the moment. I can see that there's some interest, and in theory the tests should be strict, but I'll leave that for others to decide. Feel free to take over and add the required tests.

For what it's worth, I added additional pattern to the actual use-case (that does not allow empty string, or actually any top level domain either).

Julian commented 1 year ago

As far as I can tell, for anyone picking this back up, there isn't any additional test to add I don't think beyond the ones already in this PR, so the work is simply to flip the "false"'s to "true" as the expected result, or to otherwise show where in the RFCs I quoted is a restriction on empty strings.

Julian commented 1 year ago

(And thanks @jvtm for getting this started!)

shane-kearns commented 8 months ago

In the context of DNS, the root is an empty fragment to allow for absolute paths. For example, if your DNS search path is example.com github.com could either be resolved to the A record in the github.com zone. Or it could be resolved to whatever is at github.com.example.com if that failed. Whereas github.com. is an absolute domain that must not use search paths.

The above behaviour is a gross security issue, so the default configuration for most implementations of DNS resolvers is to use the search path for path length of 1. (e.g. host would resolve to host.example.com but github.com would not due to an implementation defined constraint)

My interpretation of this:

Each node has a label, which is zero to 63 octets in length. Brother nodes may not have the same label, although the same label can be used for nodes which are not brothers. One label is reserved, and that is the null (i.e., zero length) label used for the root.

"a..example.com" -> not a valid hostname (empty label is reserved for the root) ".example.com" -> not a valid hostname (empty label is reserved for the root) "example.com." -> valid (empty label at the end indicates explicit root) "" -> not a valid hostname, but is a valid domain name (you can't resolve it usefully, but joining a host to a domain like python's ".".join(['github', 'com', '']) is a useful and valid thing to do.

About the change from RFC1034 to RFC1123 DNS is very explicitly a binary protocol with no restrictions other than length on what each segment of a name can be. You can see this clarified in RFC2181. Whereas RFC1123 defines the sort of alphanumeric host name character set we are used to seeing (and what is for example a valid hostname in a URL)

Considering this grammar in RFC952:

  <hname> ::= <name>*["."<name>]
::= [*[]]

A name consists of a let (relaxed by RFC to let-or-digit), optionally followed by any number of let-or-digit-or-hyphen terminated by let-or-digit.

These character sets are not defined in the grammar, but referring to the assumptions in section 1:

<let-or-digit> = alphanumeric, i.e. `[A-Za-z0-9]`
<let-or-digit-or-hyphen> = alphanumeric + hyphen, i.e. `[A-Za-z0-9\-]`

so a name is represented by the regular expression [A-Za-z0-9]([A-Za-z0-9\-]*[A-Za-z0-9])? "" -> invalid "a" -> valid "2" -> valid (RFC1123 updates) "-aleph" -> invalid (hyphen only allowed in the middle of a name not start or end) aleph-beth -> valid

and a hname (hostname) is 1 or more names joined with a . Interestingly this grammar does not allow "github.com." since names are not permitted to be empty.

A further complication are DNS service records, these explicitly begin with an underscore to distinguish them from hostnames. e.g. _ldap._tcp.example.com would be where to find the SRV record for DNS service discovery of the LDAP service for example.com, if one exists. I think it is reasonable to say that DNS service records are not hostnames and should not validate as "format": "hostname" in json-schema.

Julian commented 8 months ago

I had some comment written out about how I had read your comment 2 or 3 times and couldn't fully follow it because it had outside info that wasn't seemingly relevant to answering "what is the spec's behavior regardless of what you want to use it for and whether it's appropriate for that use case" -- but then the fourth time reading it I noticed a quite simple error (on my part) which I think explains the behavior, namely:

<hname> ::= <name>*["."<name>]

I think I carelessly assumed and/or forgot that that BNF syntax in RFCs puts the * repetition before and not after, so that means as you said, a name with one or more separated names, and very much not "zero or more names, followed by ..."

So I'm at least personally convinced now that they indeed are not allowed in the grammar -- I'm not sure if anyone else actually was on either side here honestly, as usual I find the "my implementation does X" not very helpful compared to "I understand the behavior to be X because that's what it says" -- if anyone does have some understanding and wants to speak up on either side feel free -- otherwise if it was just me as a holdout, I'm happy to see the merge conflicts fixed and this merged with them being invalid under these lines in the spec.

Julian commented 8 months ago

Sorry, I realize now I wrote that whole comment out the second time without saying "thank you! clearly your analysis was helpful for getting to the bottom of this now it seems"!