Open jvtm opened 1 year ago
So, reading the spec(s), I'm not sure this is correct actually!
In particular, excluding draft 2020-12 which I'll get back to, earlier drafts all say the equivalent of:
A string instance is valid against this attribute if it is a valid representation for an Internet host name, as defined by RFC 1034, section 3.1 [RFC1034].
(see e.g. here though as I say you can flip through all the drafts and they say more or less the same thing).
Looking at that section, it says:
Each node has a label, which is zero to 63 octets in length. Brother nodes may not have the same label, although the same label can be used for nodes which are not brothers. One label is reserved, and that is the null (i.e., zero length) label used for the root.
which seems to say that the empty string is indeed allowed, and in the context of DNS, refers to the root.
Draft 2020-12 on the other hand was changed to reference RFC 1123 and not RFC 1034 (some context is here):
hostname: As defined by RFC 1123, section 2.1 [RFC1123] [...]
where RFC 1123 in that section seems to simply pass things along to RFC 952:
The syntax of a legal Internet host name was specified in RFC-952 [DNS:4]. One aspect of host name syntax is hereby changed: the restriction on the first character is relaxed to allow either a letter or a digit. Host software MUST support this more liberal syntax.
where the grammar in the latter seems to say:
<hname> ::= <name>*["."<name>]
<name> ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
which also seems to allow the empty string.
Have a look and if you have some existing domain knowledge let me know, but yeah my first read looks like these are actually valid under the quoted specs.
Does this mean we should keep the tests, but ensure that validation passes instead?
To my read, yes.
Uhh, what a can of worms.. ;)
Some docs here and there limit DNS labels to be 1-63 octets (and some 0-63), and full entry max 253. But no mentions of allowing / denying no labels at all. And then the usual "use a single dot for the root".
I wonder how various real life tools work.
Yes well no good deed goes unpunished certainly. The authoritative thing here is the spec(s) / RFCs, not what tools do for better or worse, though certainly looking through them can be insightful anyhow -- I certainly saw one allowing an empty string.
I'll reverse the check and reword the commit and PR later today.
Is it worth adding any other checks?
None that I can immediately think of, let's see if anyone else chimes in between now and then, and regardless thanks for raising the PR!
I'll give this a run through my implementation tomorrow.
Looks like in Python jsonschema
idn-hostname
for both zero length string and a single dot fail.
The check uses idna
library https://github.com/kjd/idna/blob/8c703c782bf8144091c63470fcdbf29ac47eeb15/idna/core.py#L340-L370
(That's not surprising to me -- the question is usually twofold --
In that case, I'm also somewhat trusting the JSON Schema spec which says "all hostnames are valid idn-hostnames" though that too may be wrong! Tricky tricky...)
FWIW, my implementation does not accept empty string:
$; json-schema-eval --validate_formats
enter data instance, followed by ^D:
""
enter schema, followed by ^D:
{"type":"string","format":"hostname"}
^D
{
"errors": [
{
"error": "not a hostname",
"instanceLocation": "",
"keywordLocation": "/format"
}
],
"valid": false
}
..which is just a wrapper around https://metacpan.org/pod/Data::Validate::Domain#is_domain($domain,-\%options) which says it implements the RFCs.
Weirdly, my implementation says hostname
with empty string is valid, but idn-hostname
with empty string is invalid.
Ah! For idn-hostname
, I'm using System.Uri.CheckHostName()
provided by .Net. But for hostname
, I'm using the regex ^[a-zA-Z][-.a-zA-Z0-9]{0,22}[a-zA-Z0-9]$
which is very likely incorrect; I'm not sure where I got it (probably SO).
I don't have time to work on this at the moment. I can see that there's some interest, and in theory the tests should be strict, but I'll leave that for others to decide. Feel free to take over and add the required tests.
For what it's worth, I added additional pattern
to the actual use-case (that does not allow empty string, or actually any top level domain either).
As far as I can tell, for anyone picking this back up, there isn't any additional test to add I don't think beyond the ones already in this PR, so the work is simply to flip the "false"'s to "true" as the expected result, or to otherwise show where in the RFCs I quoted is a restriction on empty strings.
(And thanks @jvtm for getting this started!)
In the context of DNS, the root is an empty fragment to allow for absolute paths.
For example, if your DNS search path is example.com
github.com
could either be resolved to the A record in the github.com
zone.
Or it could be resolved to whatever is at github.com.example.com
if that failed.
Whereas github.com.
is an absolute domain that must not use search paths.
The above behaviour is a gross security issue, so the default configuration for most implementations of DNS resolvers is to use the search path for path length of 1. (e.g. host
would resolve to host.example.com
but github.com
would not due to an implementation defined constraint)
My interpretation of this:
Each node has a label, which is zero to 63 octets in length. Brother nodes may not have the same label, although the same label can be used for nodes which are not brothers. One label is reserved, and that is the null (i.e., zero length) label used for the root.
"a..example.com"
-> not a valid hostname (empty label is reserved for the root)
".example.com"
-> not a valid hostname (empty label is reserved for the root)
"example.com."
-> valid (empty label at the end indicates explicit root)
""
-> not a valid hostname, but is a valid domain name (you can't resolve it usefully, but joining a host to a domain like python's ".".join(['github', 'com', ''])
is a useful and valid thing to do.
About the change from RFC1034 to RFC1123 DNS is very explicitly a binary protocol with no restrictions other than length on what each segment of a name can be. You can see this clarified in RFC2181. Whereas RFC1123 defines the sort of alphanumeric host name character set we are used to seeing (and what is for example a valid hostname in a URL)
Considering this grammar in RFC952:
<hname> ::= <name>*["."<name>]
::= [*[ ] ]
A name consists of a let
(relaxed by RFC to let-or-digit
), optionally followed by any number of let-or-digit-or-hyphen
terminated by let-or-digit
.
These character sets are not defined in the grammar, but referring to the assumptions in section 1:
<let-or-digit> = alphanumeric, i.e. `[A-Za-z0-9]`
<let-or-digit-or-hyphen> = alphanumeric + hyphen, i.e. `[A-Za-z0-9\-]`
so a name is represented by the regular expression [A-Za-z0-9]([A-Za-z0-9\-]*[A-Za-z0-9])?
""
-> invalid
"a"
-> valid
"2"
-> valid (RFC1123 updates)
"-aleph"
-> invalid (hyphen only allowed in the middle of a name not start or end)
aleph-beth
-> valid
and a hname (hostname) is 1 or more names joined with a .
Interestingly this grammar does not allow "github.com."
since names are not permitted to be empty.
A further complication are DNS service records, these explicitly begin with an underscore to distinguish them from hostnames.
e.g. _ldap._tcp.example.com
would be where to find the SRV record for DNS service discovery of the LDAP service for example.com, if one exists.
I think it is reasonable to say that DNS service records are not hostnames and should not validate as "format": "hostname"
in json-schema.
I had some comment written out about how I had read your comment 2 or 3 times and couldn't fully follow it because it had outside info that wasn't seemingly relevant to answering "what is the spec's behavior regardless of what you want to use it for and whether it's appropriate for that use case" -- but then the fourth time reading it I noticed a quite simple error (on my part) which I think explains the behavior, namely:
<hname> ::= <name>*["."<name>]
I think I carelessly assumed and/or forgot that that BNF syntax in RFCs puts the *
repetition before and not after, so that means as you said, a name with one or more separated names, and very much not "zero or more names, followed by ..."
So I'm at least personally convinced now that they indeed are not allowed in the grammar -- I'm not sure if anyone else actually was on either side here honestly, as usual I find the "my implementation does X" not very helpful compared to "I understand the behavior to be X because that's what it says" -- if anyone does have some understanding and wants to speak up on either side feel free -- otherwise if it was just me as a holdout, I'm happy to see the merge conflicts fixed and this merged with them being invalid under these lines in the spec.
Sorry, I realize now I wrote that whole comment out the second time without saying "thank you! clearly your analysis was helpful for getting to the bottom of this now it seems"!
Assert that hostname format validation fails gracefully on empty strings.
This is especially for Python
jsonschema
library that raises an unexpected ValueError exception onhostname
check (python-jsonschema/jsonschema#1121).Adds similar test for: