Reducing the kdl/json gap

tabatkins commented 3 years ago

kdl is a great upgrade over xml in many ways, which is wonderful. But I don't think it's uncontroversial to say that one of the reasons json won so drastically over xml is that it makes Lists and Dicts, the primary CS datastructures, extremely terse and unambiguous to write, which xml achieves neither of those properties. kdl, unfortunately, inherits some of this weakness from xml, which I ran into while developing JiK. There are two primary issues that I think would be relatively easy and high-value to fix, without significantly breaking the current standard.

First, KDL draws a bright line between values, which must be primitives, and children, which must be nodes. These can't be intermixed; you have to write all of a node's values first, followed by all of its children. This prevents you from easily encoding simple data structures like a nested list, where each list item can be a primitive or another list. It's also the sole place where kdl is actually weaker than xml, I think - xml can mix text and child nodes together in its child list, while kdl can't.

In JiK I worked around this by blessing the - node: it's treated specially as a "primitive wrapper", containing only a single value, which is the primitive it represents. This lets you mix primitives and nodes freely. (And XiK does something identical, just restricting the wrapped primitive to be a string.)

I think we could bless this in the syntax, to allow primitives to be written in the child list. Proposal:

The single dash - is forbidden as a node name.
Instead, in a child list you can precede a primitive with a single dash. This unambiguously distinguishes it from nodes (otherwise a string would look like a node with no values or children).
The - syntax represents the primitive itself; - "foo" is not "a node named - containing a single value 'foo'", it's "the primitive string 'foo'", full stop.
The - syntax only contains that single primitive. Trying to give it more values, named values, or child nodes is just a syntax error. It still has to be ended by a newline or a ;, like any other node, tho.
(Optional: - can take multiple unnamed values, so you can easily write several primitives in a row without requiring them to be separated like nodes; - 1 2 3; rather than - 1; - 2; - 3;. Unsure if this is a good idea.)
The - might be slightly confusing when paired with numbers: the visual distinction between - 1 and -1 is pretty low. Another possibility is we lean on the non-initial chars list, perhaps choosing > as the indicator, so >1 or >"foo" is the way to write primitives in a child list.

Second, KDL inherits XML's "attributes are named, children are positional" distinction. It loosens the restriction on attributes, allowing them to go unnamed/positional, but still requires child nodes to be purely positional. The only way to "name" a child node is by using the node name itself, which presents an uncomfortable tension between using the node name for organization purposes in the parent and using it to identify the child data.

In JiK I solved this by mandating that the child nodes of an object node must use their first value to express their key as a string, so {"foo": [1,2]} becomes object { array "foo" 1 2; }. This is unambiguous in JiK itself, because only the object node needs this and it needs it for every child (so you know for a fact that, say, that array child doesn't actually represent ["foo", 1, 2]), but it doesn't work in general, and also it's an ugly hack that's hard to read.

I think this should be blessed as well, but with a better syntax than what I used in JiK. Here's a possible suggestion:

Since = is already disallowed from node names, just use that - you can write foo=node 1 2 3 in the child list, and it represents a named child node. This even maintains the "kdl kinda looks like console commands/bash" metaphor, since this kinda looks like bash scripting assigning things to variables.
Whitespace shoudld be flexible around the =, so foo = node 1 2 3 is also possible, etc.
Primitives once again pose an issue, since foo = true is ambiguous between the primitive true and an empty node named true (which is allowed in any other circumstance). Sadly, this might mean we need to mandate the use the - syntax here as well to encode primitives, so foo = - true, or foo=- true in the most compact form. Not the prettiest.
The alternate suggestion of > for primitives might work a little better here - foo=>true (primitive) vs foo=true (empty node named true).
Alternately, just disallow primitives here. Named values aren't meant to be ordered anyway, so mandating that named primitives have to come first, in the value list, might be fine. It makes some patterns a little more awkward, but that might be okay for the tradeoffs involved.

With these suggestions, JiK would almost completely disappear. A JSON document like {"foo": [1, 2, {"bar": 3}], "baz":4} would become idiomatic KDL:

object baz=4 {
    foo=array 1 2 {
        object bar=3
    }
}

or possibly

object {
    foo=array {
        >1
        >2
        object bar=3
    }
    baz=>4
}

zkat commented 3 years ago

You know.

I wonder what would happen if we unified the properties/values/children "namespaces" and made properties/named children equivalent, and values/>-children equivalent...

So these two would become the same thing:

foo 1 2 3
bar a=1 b=2 c=3

and

foo {
  - 1
  - 2
  - 3
}
bar {
  a 1
  b 2
  c 3
}

That is, make it so node-level values/attributes are literally just syntax sugar for named and anonymous nodes?

zkat commented 3 years ago

Continuing this train of thought, your JSON example now becomes:

foo {
  - 1
  - 2
  - {
    bar 3
  }
}
baz 4

Which looks a little weird, but can reliably be converted to dict/list data models?

Lucretiel commented 3 years ago

I wonder what would happen if we unified the properties/values/children "namespaces" and made properties/named children equivalent, and values/>-children equivalent...

FWIW, I was originally planning on using this interpretation in kaydle, where a struct-from-node would accept both named children and properties as fields. I did end up deciding that probably the child/property distinction is meaningful to a struct author, which means that a node that has both properties and children will have the children collected into the last field. That is:

/*
Deserializes from:

node x=10 y=20

and also:

node {
    x 10
    y 20
}
*/
struct Simple {
    x: i32,
    y: i32.
}

But:

/*
Deserializes from:

node name="Hello" description="desc" {
    x 10
    y 20
}

Or:

node {
    name "Hello"
    description "desk"
    simple x=10 y=20
}
*/
struct Complex {
    name: String
    description: String
    simple: Simple
}

Lucretiel commented 3 years ago

This prevents you from easily encoding simple data structures like a nested list

I was assuming that we'd use the standard XML pattern, where a uniform list is just a list of uniform nodes. That is, [[1, 2, 3], [4, 5, 6], [7, 8, 9]] would be encoded as:

item 1 2 3
item 4 5 6
item 7 8 9

tabatkins commented 3 years ago

That is, make it so node-level values/attributes are literally just syntax sugar for named and anonymous nodes?

I did end up deciding that probably the child/property distinction is meaningful to a struct author,

Yeah, this is the conflict I wrestled with too. I recall commenting early on Twitter, when kdl was just an idea, that I disliked the attribute/child split of XML, because it wasn't clear what should be an attribute vs a child. I've come to think that this was in fact just a complaint about XML's "string is the only data type" issue, and that I do in fact appreciate having a child list separate from the node's "own" values, since kdl is rich enough with data types.

But I'm still conflicted! JiK allows you to use values and children interchangeably, and that seems useful, but then XiK depends on the two being separate (or at least, named primitives, corresponding to XML attrs, being distinguishable from child nodes), tho it also allows a final string value to represent a child node, so there's ambiguity there too.

And finally, maintaining that distinction means that you can't use child nodes to represent a node's own data, so you're still fundamentally limited in what you're capable of representing.

So eh, six of one and half-dozen of the other, but I suspect that overall I lean slightly towards "nodes just have a bunch of children, some of which are anonymous and contain primitives".

That said! We still need to distinguish syntactically between a named child and a node's name! That is, this:

bar a=1 b=2 c=3;
bar {
  a 1
  b 2
  c 3
}

doesn't work; the key and the node name are conflicting. That's acceptable in some contexts, but not others (most?). We still need an explicit indicator that you're providing a key for a child, rather than just providing a positional child node. That's why I suggested the = syntax:

bar {
 a=1
 b=2
 c=3
}

But, as I noted, this still needs a step more; even tho true/false/null are now disallowed as identifiers, there's still ambiguity with nodes using quoted names:

bar {
 a="foo" 
 // is this the string "foo", or an empty child node named "foo"?
}

And unfortunately, we can't just declare that only primitives can be named in the child list, as that still promotes the "values and children are distinct" chasm, and still wouldn't allow JSON to be cleanly encoded in KDL.

zkat commented 3 years ago

Honestly I keep reading all this and thinking about lists and maps and thinking... you know what? Maybe this is just not gonna be what KDL is for.

I started working on KDL because I wanted nice config files, and I think KDL is already excellent at it, and the node-based workflow is ideal for that kind of thing, because of its clarity and flexibility and compactness without noise.

Maybe the answer here is "stop trying to force KDL to be the Every Language", and let it happily settle into the niche it was meant for in the first place? Like, it's great at the thing XML is good at (except markup), and I think that's fine? What do you say?

My other thought on the matter is maybe we can have some kind of declaration that a certain file is JiK or XiK and have parsers that support those modes actually verify this, but at that point, KDL starts just feeling like... a different language? idk idk.

tabatkins commented 3 years ago

I definitely don't want to harm the original use-case here; it's a great one! The JSON impedance mismatch is slightly annoying but not killer if we decide not to address it, but I still feel that there's a niggling problem being left even if we put JSON to the side: nodes can have both positional and named primitives, but only positional children, and that feels like an odd distinction to me. It seems like all the reasoning for having named values applies equally to having named child nodes, or am I wrong?

zkat commented 3 years ago

you're not harming anything! And I might be throwing in the towel a bit too soon just in the interest of getting 1.0 out, but... if we can find a solution I'd like to? I just feel like I keep trying to fit a square in a circle peg.

As far as positional "children": SDLang actually has this! And I removed it! Because I thought it obfuscated the fact that children are exactly one type: nodes. And that's exactly what SDLang does! So we can make it seem like there's value children, but that's just gonna obfuscate the fact that these are just nodes under the hood and that feels weird to me.

tabatkins commented 3 years ago

Sorry, we're mixing concepts here - afaict, you're talking about SDLang's anonymous nodes, which let you put primitives in the child list, yeah? I'm fine with avoiding that; we might still want to bless - for this purpose as a reserved node name, but it's a separate issue. (and even if it's not blessed, jik/xik both work just fine with using - as an ordinary node that they give special meaning to, so no big)

I was instead talking about named child nodes; just nodes, but with a name/key rather than just a position, like foo="bar" vs just "bar" in the value list. You've mentioned in earlier comments (and so did @Lucretiel) just treating the node's name as its key, but I don't think that's usable in general; it conflates the node's name as "what kind of data is this" (its normal meaning) and "what role does this play in the parent node" (what a key should do). (Basically, it seems identical to saying that we don't need named values anymore, since we have type tags than can serve the same purpose; you could write node (foo)"bar" instead of node foo="bar". I think it's pretty easy to see why that's bad.)

If we do toss out the "put primitives in child list" idea, then just allowing foo=node 1 2 3 in the child list is fine syntactically and wouldn't be, afaict, confusing in the data model.

zkat commented 3 years ago

This sounds like it would add another component to the data model:

struct Node {
    name: String,
    props: HashMap<String, Value>,
    values: Vec<Value>,
    children: Vec<Node>,
    named_children: HashMap<String, Node>,
}

and I'm honestly trying to process what this would mean. @Lucretiel do you have any thoughts about what this kind of change might do when interacting with your data model for kaydle?

larsgw commented 3 years ago

it conflates the node's name as "what kind of data is this" (its normal meaning) and "what role does this play in the parent node" (what a key should do)

I think there's a lot of precedent for node names to both mean what kind of data it is, and what role it plays in the parent node, though not necessarily both at the same time. Of course, KDL could do "better" in that aspect. However, the distinction is not even all that clear in my opinion. Are a title and description the same type of data? In KDL Schema they both have a single text value and an optional property that gives the language, so that sounds pretty similar. But you could also apply validation to the title to limit its length and require title case, which would make the text a different "kind" of text I feel. Either way, the relation to the parent is strongly, maybe even unambiguously implied.

I did encounter use cases already, in KDL Schema where in the info node author and contributor both have the same values/properties/children (both nodes represent a person) but different relations to the parent. In some cases however a name might be a bit redundant, either because a relationship to the parent could only have one type of node, or because a type of node unambiguously implies a certain relationship to the parent. For reference, info could look like this:

    info {
        title = title "KDL Schema" lang="en"
        description = description "KDL Schema KDL schema in KDL" lang="en"
        author = person-list {
            person "Kat Marchán" {
                self = link "https://github.com/zkat"
            }
        }
        contributor = person-list {
            person "Lars Willighagen" {
                self = link "https://github.com/larsgw"
            }
        }
        documentation = link https://github.com/zkat/kdl
        license = license "Creative Commons Attribution-ShareAlike 4.0 International License" spdx="CC-BY-SA-4.0" {
            documentation = link "https://creativecommons.org/licenses/by-sa/4.0/" lang="en"
        }
        published = date "2021-08-31"
        modified = date "2021-09-01"
    }

But without syntax modifications also like this, which also separates the two concepts:

    info {
        title "KDL Schema" lang="en"
        description "KDL Schema KDL schema in KDL" lang="en"
        person "Kat Marchán" rel="author" {
            link "https://github.com/zkat" rel="self"
        }
        person "Lars Willighagen" rel="contributor" {
            link "https://github.com/larsgw" rel="self"
        }
        link https://github.com/zkat/kdl rel="documentation"
        license "Creative Commons Attribution-ShareAlike 4.0 International License" spdx="CC-BY-SA-4.0" {
            link "https://creativecommons.org/licenses/by-sa/4.0/" lang="en"
        }
        date "2021-08-31" rel="published"
        date "2021-09-01" rel="modified"
    }

zkat commented 3 years ago

...I definitely don't like that first example as much as the second.

larsgw commented 3 years ago

I hope I didn't misrepresent anyone's argument with that example but it seemed the logical extension of the "right thing" to do with that syntax.

tabatkins commented 3 years ago

This sounds like it would add another component to the data model:

It would, yeah.

I think there's a lot of precedent for node names to both mean what kind of data it is, and what role it plays in the parent node, though not necessarily both at the same time.

That's such a great example. Shows off good things (published and modified both just being date nodes), bad things (title = title..., ugh), and mixed things (explicitly say that author and contributor are multi-valued, vs that being implicit in the data model, but also more syntax to set up the list vs just providing each as they go).

And if I'm looking at this thru the lens of config files, rather than data structures... I like the second one better, too. Hm. HMMMM.

Okay, spitballing. What if the node name could have an optional tag just like primitives, which you could use for whatever, but which idiomatically is used to communicate key alongside type when necessary?

info {
    title "KDL Schema" lang="en"
    description "KDL Schema KDL schema in KDL" lang="en"
    (author)person "Kat Marchán" {
        (self)link "https://github.com/zkat"
    }
    (contributor)person "Lars Willighagen" {
        (self)link "https://github.com/larsgw"
    }
    (documentation)link https://github.com/zkat/kdl
    license "Creative Commons Attribution-ShareAlike 4.0 International License" spdx="CC-BY-SA-4.0" {
        link "https://creativecommons.org/licenses/by-sa/4.0/" lang="en"
    }
    (published)date "2021-08-31"
    (modified)date "2021-09-01"
}

I don't know about you, but that looks kinda really good to me? It also unifies the functionality across all types of objects in KDL; beyond the basic semantics communicated by its ordinary syntax (being a number, string, node, etc), any value can have specialized semantics given by a tag: this is a contributor person, this is a date string, etc. And it avoids adding another component to the data model.

My JiK example would then become:

/*
{
    "foo": [1,2,{bar:3}],
    "bar": 4
}
*/
object {
    (foo)array {
        - 1
        - 2
        object bar=3
    }
    (baz)- 4
}

which feels a lot more acceptable imo. That's almost good, which is kinda amazing considering the impedance mismatch we're working with.

larsgw commented 3 years ago

And then multiple person nodes with the tag author are allowed. Sounds good honestly.

tabatkins commented 3 years ago

Yup, JiK would just have a constraint that the tags be unique in an child list, but that doesn't need to (and shouldn't) carry over to KDL in general.

larsgw commented 3 years ago

The only thing I worry about is that it may seem that the meaning of the tags gets switched around: in properties the identifier is the relation to the node and the tag is the data type, in nodes the node name is the data type and the tag is the relation to the parent node.

tabatkins commented 3 years ago

While it can serve a purpose adjacent to that of the key on a named value, it's not exactly a key and is indeed parallel to the usage of tags on primitives - the node name is the type of node just as the syntax is the type of primitive ("" for strings, digits for numbers), and the tag is a custom elaboration of that type. The relation to the parent is still implicit, just as much as a title ... node implying it's the title for the parent node.

Like, (published)date "1970-01-01" could just as easily be written as publish-date "1970-01-01", it's just that the former allows you to easily talk about the date node and its structure in a generic way in your docs if it's used in several places (rather than having to list exactly which nodes have a "date-like structure"), or allow something to take several types of date-like things without having to provide publish-* node variants for each; (published)timestamp ..., for example.

tabatkins commented 3 years ago

@zkat Thoughts? I can put up a PR if you'd like.

zkat commented 3 years ago

Hmmm.

I've been going back and forth about this, which is why I haven't responded.

I don't know whether I want this yet, but part of my concern is the added implementation complexity, as well as the confusion about whether you should use tags or type annotations for what role. Like, what's the actual guidance for (published)timestamp ... vs (timestamp)published...? I think for this kind of use-case, I like the way the rel= version looks better, but also I know it's less standard. And i do like that this might make JiK nicer.

"the type annotation on the node" does not make me immediately think "ah yes, this is the key I should use", and I think it's confusing when it uses the same syntax as something meant to annotate value types? Does that make sense? I'm just on the fence still tbh.

tabatkins commented 3 years ago

The idea is that it should be similar to the thing annotating value types; if you have nodes named for a relatively generic structure (person, date, etc), the tag specializes them, just as date"..." specializes a generic string. We might just be thinking of these with different models, tho.

Confusion over which info goes where is legit, tho I think it already exists. For values, for example, title="foo" vs (title)"foo" vs a title "foo" child node is already something one would have to grapple with. "It's already confusing, so it's okay to add a fourth source of confusion" isn't a great argument tho, I'm aware. ^_^ But it is true that there's already significant flexibility in the syntax, by design, which requires authors of KDL usage to decide between several possible ways to encode a given piece of data, so this isn't a new problem.

Putting JiK to the side, I just think this conflation between "node name as role in parent" and "node name as type for contents" is going to bite people. KV stores are very common in configs and elsewhere, and it's something that XML does very badly (imo). As you can see from the example, it means that people have to smuggle the information in somehow, either with custom node names that serve both roles at once, or with a named value on the node like rel="" which will only be, at best, something driven by best practice, and which hides the parent-relationship in the node's data where it's harder to see.

I think that with good examples, and hopefully some early usage, we can drive the preferred division of responsibility - node names describe the type of data, node tags specialize that into the relationship to the parent (when the node name itself isn't sufficient).

(I will also say that "no way to give named child nodes like you can give named values" has been my one persistent niggling issue since the beginning of KDL, and with this or something similar I think I'd personally finally be 100% happy with the lang. But that's just me; you're the boss here.)

Like, what's the actual guidance for (published)timestamp ... vs (timestamp)published...?

The node name should always dictate what the node is, and what it contains; it's the one piece of data that's always present, while the tag may or may not be. published 1631042979 might make sense on its own, depending on the usage in question, but the (timestamp) annotation won't add anything; in particular, you wouldn't have both published 1631042979 and published "2021-09-07" and expect a (timestamp) or (date) tag to clarify which is which.

zkat commented 3 years ago

hmm. I'm not opposed to taking a PR for this. Thinking about it, it's ultimately up to those using KDL for their own configuration formats to decide whether this kind of feature makes sense for them, just like with type annotations.

As such, let's just merge it in, tag KDL 1.0, and see how it goes. We can always remove it if it turns out to be Very Bad, right? :)

tabatkins commented 3 years ago

That's the spirit. ^_^

I'll do my best to provide some good technical guidance about it, PR incoming either this afternoon or tomorrow.

larsgw commented 3 years ago

Are you working on updates to the schema schema too? Otherwise I can take that up.

tabatkins commented 3 years ago

Feel free; I haven't read the schema in detail, so if you're already up-to-date on it you can probably do it faster/more accurately than I can.

larsgw commented 3 years ago

Thinking about it, it's ultimately up to those using KDL for their own configuration formats to decide whether this kind of feature makes sense for them, just like with type annotations.

@zkat do you want the info node in the schema spec to use the tags already or no?

zkat commented 3 years ago

I'd rather not use it for Schema, no. It's not a feature I see myself using very much, but one that I hope adds some flexibility for use-cases that need it. I like the rel thing to be Good Enough for us.

kdl-org / kdl

Reducing the kdl/json gap #105