Open brunoborges opened 3 years ago
Looks good so far, although I feel that the stated namespace restriction shuts down a promising opportunity before it can be explored. Although I'm inclined to avoid so-called "microformats" as being overly broad in scope, I'll set that objection aside for the time being.
It may be possible to treat a subtable like it's its own TOML hash table, assess a separately specified schema against it, and integrate that assessment into the assessment of the original parent document.
This could be done easily within a TOML file simply by giving a table [subtable]
its own toml-schema subtable, like this, which would apply only to that table and its nested descendents. The key names would be localized during subtable schema assessment, but that's the only complication I see.
[subtable.toml-schema]
version = 2
location = "<url>"
What would your objections be to allowing such a sub-schema application?
The default
key is a little confusing. How would it be used in practice? By that I mean, when the schema is checked and a key with a default isn't present, then what specific process assigns the default to the key in the resulting configuration? The parser? An active schema validator? The application?
Certainly not a standalone schema validator, because no configuration would be constructed when it runs.
But is default
, then, just a fancy comment for what the application should do with a missing key? # assign this if missing
?
when the schema is checked and a key with a default isn't present, then what specific process assigns the default to the key in the resulting configuration?
@eksortso my thinking is that the TOML parser should notice the existence of the schema reference, and then validate the document against the schema, and for any missing key, it will check for a default in the schema and grab the value from there, and construct the resulting TOML object with that value.
@eksortso regarding namespaces, what I found to be challenging is the recursiveness of the schema. I'd be happy to support it if someone can bring a solution to the problem.
Forgive the intrusion, but ...
@brunoborges Well, in a sense, there's already limited support for multiple schemas: [toml-schema]
has its own schema, and it doesn't need to be repeated in a TOML document for it to be checked. But that does hint at an approach that could be applied throughout a document.
Any part of a TOML configuration will have, at most, one schema to rule over it. The schema over that part would be defined by the toml-schema
assigned to it or to its nearest parent. A local schema would completely shadow a more global schema, so there would be no threat of recursion. Any toml-schema
table or subtable intrinsically has toml-schema as its schema.
The only complication in a plan like this is how we could assign a schema to every table in a table sequence. An array has no way to assign a table to a key with no table of its own. Perhaps the first table in the sequence can have a subtable called sequence-toml-schema
, or some better name, to assign a schema to all table elements that don't have their own toml-schema
. And such a sequence-toml-schema
would use the intrinsic toml-schema.
Perhaps the first table in the sequence can have a subtable called sequence-toml-schema, or some better name, to assign a schema to all table elements that don't have their own toml-schema.
This is what I found to be difficult. The moment extra tables must be added later on in the document to support more metadata, the TOML document starts to lose its appeal.
One idea that did come to mind was a namespace prefix, just like XML/XSD does.
[toml-schema]
version=1
location="url..."
[toml-schema.cust]
version=1
location="url for customer namespace schema"
[title] # this is top-level schema
name="Customers Orders Configuration"
[customers] # part of top-level schema
[cust:customers.orderSettings] # this one is a customer element
maxitems=3
region="North America"
shipping="UPS"
[customers.orderSettings.header] # this is still part of the customer namespace as it is a child of a table linked to the 'cust' namespace.
comment="some comment"
But then, how to reference the customer schema from the top-level, general schema?
Is this not just a slightly-more-fleshed-out duplicate of #629, #76, or #116? On #629 specifically @pradyunsg makes the point that
That's non trivial and TOML won't be gaining such complexity
I realize this proposal is 'better' than #629 in that the schema is itself a separate TOML document, and I appreciate the amount of thought that has gone into it, but TOML is supposed to be simple and human-oriented; I don't buy any claims that a schema would be solving a real problem. If the TOML is so complex that it requires the parser to perform context-aware validation against a nontrivial schema, maybe TOML was the wrong tool for the job to begin with.
Also consider that by adding a 'url' you're implying the TOML parser needs to either have network awareness built-in (at the very least the ability to do a basic HTTP fetch), or require applications to implement that themselves via callbacks. Neither are great options, and are likely to be impossible in many contexts. One of TOML's selling points is it's minimalism, both in syntax, and thus the subsequent implementation. The requirements for implementing URL fetching will be a complexity/bloat bridge-too-far for many implementations.
@marzer Here are a few points:
location
parameter doesn't have to indicate a remote URL. In fact, this should read URI
and can be a local file. The parsers may provide extra feature to map a URL back to a local file, for offline processing/validation. Just like XML/XSD parsers have supported this scenario for more than a decade.So in short: the proposal is to find common ground, without adding complexity to the TOML specification itself, but to ensure the specification recognizes the existence of the TOML Schema, and allows for a standard way for defining a pointer to a schema file. That's all.
@marzer I agree with your essential point:
One of TOML's selling points is it's minimalism, both in syntax, and thus the subsequent implementation.
But at this point, toml-schema is a separate project, and the impression I get from everyone so far is that it'll always be separate from core TOML, even if it's heavily adopted. It imposes nothing on the core standard, and the schemas themselves are fully compliant TOML documents.
I will disagree with you, vehemently, on the matter of complexity. Configurations always start small. But if a configuration is intended to scale up, there may come a time when a little help to keep things in line would be appreciated, especially when that help is a pure add-on with no additional load borne by the standard.
Just adding a little perspective here:
The syntax isn't the issue: The syntax for JSON or YAML or INI files aren't particularly complex. Heck, the syntax for XML isn't all that complex in most cases.
The issue is knowing what keys are available and the expected/valid values for each key.
Take, for example, Windows Terminal's settings.json
file. It's the lifeblood of the Terminal in which one can configure the Terminal's many features. Settings are categorized into four areas: General Settings, Profile Settings, Color Schemes, and Actions.
Without a schema, remembering the names and values for each of the settings is a PITA and having to constantly refer to the docs is not productive.
WITH a schema, editors like VSCode make writing settings a breeze:
While I think that a TOML schema mechanism is a good idea, I agree with others here that it must be optional: TOML parsers may consider schemas, but they are not required to do so. A logical and in my viewpoint very important conclusion from this is that the absence or presence of a schema must not change the data structure resulting from parsing a valid (and schema-valid) document.
Therefore, a ''default'' key as described above cannot be part of the TOML schema spec, since otherwise a schema-aware parser would parse documents into different data structures (with defaults added) than a schema-ignorant parser. Let's not go down that road, since it would fragment the TOML community.
I don't think anyone is mandating that every TOML doc must have a schema. But we are advocating that TOML should offer/support schemas when presented.
@ChristianSi one more time for the sake of the debate: XML and XSD are two separate specifications. One (XML) recognizes the existence of the other (XSD), but (XML) does not require it (XSD).
Not all XML documents must have a schema.
Therefore, the proposal is to discuss, along the key TOML contributors and the TOML community in general, whether there is room for a TOML Schema specification, how it should work best, and how TOML specification should recognize its existence in a way that is standardized (e.g. [toml-schema]
), but completely optional.
@eksortso right now, the grammar I drafted does not suggest a fully compliant TOML document, but similar. If you look closely to the ABNF, it suggests a few keywords for built-in types, that are not quoted as strings.
Example:
[document.property]
type = array
arraytype = string
What do you think?
@brunoborges
So in short: the proposal is to find common ground, without adding complexity to the TOML specification itself, but to ensure the specification recognizes the existence of the TOML Schema, and allows for a standard way for defining a pointer to a schema file. That's all.
Well as long as it remains fully optional, such that a parser can completely ignore a schema URI and still remain compliant, I guess I have no complaint. To that end, I second @ChristianSi 's point:
''default'' key as described above cannot be part of the TOML schema spec, since otherwise a schema-aware parser would parse documents into different data structures (with defaults added) than a schema-ignorant parser. Lets not go down that road, since it would fragment the TOML community.
@brunoborges Making TOML schemas themselves fully compliant TOML document sounds like a very good idea. "Eat your own dogfood" and don't proliferate file formats and parser requirements needlessly. Just adding a few quotes here and there seems like a worthwhile price.
I have two questions about the proposed syntax:
1) If the schema refs are to be part of the TOML document structure with a 'magic' table named toml-schema
, does it mean that table name is now reserved by the spec, and tables with that name should be validated accordingly by schema-aware parsers?
2) Should schema-aware parsers emit the toml-schema
table in the parsed data tree, to keep with older parsers that would treat it as just ordinary, non-magic data?
Note that neither questions need answering at all if the schema is not a part of the TOML document, and instead uses magic comments or similar. Something like:
##! toml-schema = { version = 1, location="url..." }
Which also has the upside of appearing visually distinct from regular TOML, though adds complexity to the language since that requires changes to the ABNF.
@brunoborges I apologize, because I've been basing my assessment on the project README only, and that's not in sync with the project's ABNF.
The README does imply that the schema must be a separate document, because the only thing that the TOML document needs to have is a [toml-schema]
table with an external reference in location
. Theoretically, and in regard to @marzer's points, that schema could be embedded in the TOML, but only if it's fully TOML-compliant.
Now regarding those non-TOML-compliant value keywords. As long as they remain in TOSD docs, then schemas could have special unquoted value keywords in the TOSD format. That could still have a knock-on effect:
type
values.I'd love to add enumerated values and option types to TOML, but I wouldn't do anything to encourage that, at least not just yet.
@marzer Your comment suggestion reminds me of #522, which was specifically about TOML version pragmas.
Could we use a similar pattern for referring to TOML schemas? Something like the following appearing at the top of the document?
# TOML Schema: v2 https://config.example.com/schema.tosd
@brunoborges Is the version
value necessary? Couldn't a separate URI point to the appropriate version of the schema doc?
toml-schema.location = "https://config.example.com/schema_v2.tosd"
Is the
version
value necessary? Couldn't a separate URI point to the appropriate version of the schema doc?
@eksortso the idea of having the version, is to double check the intent. If schema.tosd
is now v2.1 internally (although still on the same URL), but the TOML document still refers to the same URL, the parser should double check the version intent and throw an error if it is trying to validate a TOML document with schema v2 against a v2.1 schema file.
So, while it is not necessary, it would add some protection.
I'd love to add enumerated values and option types to TOML, but I wouldn't do anything to encourage that, at least not just yet.
Yeah, I am not a fan of the unquoted enumerated values either. I just really thought they'd make things easier for extensions/plugins and therefore developer experience in general, but I think you are right to say that it is not impossible to add the quotes.
That said, I'll document that a TOML Schema must be a TOML compliant document. This does raise the question: Is there still a need for a TOML Schema ABNF grammar? I tend to believe that yes it is still needed, to ensure of the structure.
@eksortso Any thoughts?
@marzer here are my thoughts on your two questions:
- If the schema refs are to be part of the TOML document structure with a 'magic' table named
toml-schema
, does it mean that table name is now reserved by the spec, and tables with that name should be validated accordingly by schema-aware parsers?- Should schema-aware parsers emit the
toml-schema
table in the parsed data tree, to keep with older parsers that would treat it as just ordinary, non-magic data?
It is unfortunate that the TOML specification does not set a meta-table format for information regarding the document type (e.g. the version).
HTML for example has a standard way to do so:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
If TOML specification allowed for such standardized construct, then the schema reference could be part of it, along with the TOML specification version that could inform parsers of other metadata.
But, assuming that such construct will never be part of the specification, then my thinking is that we have a few options to consider:
[toml-schema]
Schema-aware parsers must evaluate this, and validate the document against the referenced schema. This table must not be part of the document tree, unless the parser is instructed to do so (opt-in).
Non schema-aware parsers must ignore this table and not append it to the document tree, unless the parser is instructed to do so (opt-in).
[toml-schema]
Schema-aware parsers must evaluate this, and validate the document against the referenced schema. This table must not be part of the document tree, unless the parser is instructed to do so (opt-in).
Non schema-aware parsers by default will treat this table as a regular table and append it to the document tree, unless the parser is instructed to ignore it (opt-out).
The proposal of adding a new construct in the TOML specification seems to be the right solution, as long as this is part of the specification and the grammar.
I really like the following, because it is TOML-compliant in both ways: a comment for non-schema-aware parsers, and a [toml-schema] table for schema-aware parsers. This design meets the same intent as DOCTYPE
in HTML.
##! toml-schema = { version = 1, location="url..." }
I would vote for this proposal, without any doubt 👍
@brunoborges
Non schema-aware parsers must ignore this table and not append it to the document tree, unless the parser is instructed to do so (opt-in).
I suppose I should clarify what I took "not schema-aware" to mean here: an old parser that knows nothing about this new feature. If it knows about schemas but chooses to ignore them, then it is schema-aware but also non-enforcing.
Moot point, though; I agree with your points above that having it pragma-style in comments is the likely the right direction.
Therefore, a ''default'' key as described above cannot be part of the TOML schema spec, since otherwise a schema-aware parser would parse documents into different data structures (with defaults added) than a schema-ignorant parser. Let's not go down that road, since it would fragment the TOML community.
@ChristianSi I think you are making a really good point here regarding default
.
Ideally, a TOML file should output the same data regardless of what parser was used, as long as the parser is compliant with the version of the TOML specification. And if a schema-aware parser generates a data object that is different because it followed the schema and grabbed a few default values, then ultimately the file is different.
In essence what you are saying is that TOML Schema must not influence/modify the data of a TOML file. A TOML Schema can only dictate the data structure and data types; never data input.
I'm down with that.
Is there still a need for a TOML Schema ABNF grammar? I tend to believe that yes it is still needed, to ensure of the structure.
@eksortso Any thoughts?
@brunoborges Well, I'm a big fan of dogfooding, so my advice would be to write the schema standard as TOML using itself to check it. This will hold a lot more weight once TOML v1.0.0 is finally released. I'm not saying this just to be flippant; after all, ABNF was defined using itself for its first specification.
That said, if you want to keep the ABNF around, would it be possible to use the case-sensitive string syntax introduced in RFC 7405?
Hi all,
I incorporated some of the feedback here, and for now, also decided to not focus on the ABNF grammar, and instead on a set of rules. I believe ABNF may be useful later to generate a parser that validates the overall structure of the TOML Schema document.
It is also starting to seem possible to draft a recursive TOML Schema file to validate the TOML Schema itself.
I'd appreciate those interested in this proposal if you could review the new README documentation.
Thank you
@brunoborges I've left some feedback/nit-picks on your Discussions page: https://github.com/brunoborges/toml-schema/discussions/4
(Is that where you want that sort of thing? Or here?)
Would be awesome to get this accross. Was just looking for it.
I am inclined to believe that it would be much wiser to just use JSON Schema to validate the JSON object resulted from loading a TOML document.
Instead of reinventing the wheel, try to reuse it. There is a huge number of validators and they could easily be retrofitted to also work with .toml
files. Spend time and effort on things specific to it, especially as schema validation language by itself is a very complex problem. TOML will never be able to catch-up with hundreds of others that are working to improve the json schema validation ecosystem.
I am aware of at least one vscode extension that already implementing support for using json-schemas to validate TOML files. See https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml#completion-and-validation-with-json-schema
New schemas can be added to https://www.schemastore.org/json/ database, which can be used to automatically pick by editors without having to implement manual associations. Sadly, the extension above does not seem to implement support for schemastore yet but I see no reason why they would not want that. Schemas can be either included in the database or just linked to their location.
See the comment by @bitcrazed .
I am inclined to believe that it would be much wiser to just use JSON Schema to validate the JSON object resulted from loading a TOML document.
TOML handles some types (date, time, datetime) and values (+- inf, nan) that JSON (and therefore JSONschema) doesn't, so JSONschema would need to be extended a bit.
I've used the .toml
format as an end-consumer for several years in Rust and a handful of other projects. It's clean and offers notable advantages over other common configuration file formats. I'd even say it seems like an optimal choice for user-maintained configurations.
Now I am looking to suggest TOML as a configuration format for a project I am currently working on, I have been quite disappointed to find there is no established schema, especially given the long history of it having been proposed.
There seem to have been several excellent suggestions, most notably the one proposed here by @brunoborges. But the most recent comments about JSON Schema are disheartening when they clearly are missing the point. In general, seeing things like this are a red flag to me that the core format doesn't have enough support and requires external intervention to provide a proper feature set.
Yes, JSON Schema can work, but
At this point, I have a few non-ideal choices:
It would be great to see this conversation make some progress.
That being said, is there anything the existing proposals are missing, besides being accepted by the maintainers of the core product? I'd even be willing to help kickstart some tools around toml-schema
(like a VS Code extension) just to get the ball rolling - provide some POCs that could be iterated on.
What would it take to get some traction here?
That being said, is there anything the existing proposals are missing, besides being accepted by the maintainers of the core product?
Nothing really; I think everyone already agrees that TOML itself doesn't really need any additional features for toml-schema to work, and that it should always live as a separate specification.
The only point is that there's no clear way to point at a TOML schema from the TOML document itself. The toml-schema document has a special [toml-schema]
table for that now. This works, but applications will need to handle it (i.e. not throw an error on "unknown table").
IMHO the comment-based syntax suggested earlier makes more sense:
##! toml-schema = { version = 1, location="url..." }
Although personally I'd probably opt for a simpler space-separated format so you don't need to parse TOML inside a comment:
#schema 2.4 http://example.com/schema.tosd
This way you can add easily TOML schema to $any document without requiring any support from the application.
The main missing part is just that no one has written any tooling for this; there is no toml-validate file.tosd file.toml
tool, as far as I can tell anyway. Nothing in the TOML specification itself is really preventing anyone from writing that though.
So the questions here are basically:
Do we want to "officially" support it, for example by mentioning/linking it in the specification and/or moving the repo to https://github.com/toml-lang/toml-schema (that last one has some practical difficulties though; see #895)
Do we want to "officially" support some way to embed the metadata, either in the form of a "special table", "special syntax", or an entirely new pragma syntax.
My personal answer for 1 would be "yes, when there's at least one tool that implements it", and for 2 it would be "no, the toml-schema specification can use a comment-based pragma for this".
I found taplo a few days ago, it uses comment based schema references, but json schema to provide validation/autocompletion for toml files.
While json-schema clearly is an imperfect solution to specifying a toml document, I think it might make sense to take existing tooling into account.
Let's say we have a simple TOML document:
title = 'How to polish coins'
published = 2023-09-23
[author]
name = "Donald Duck"
email = "donald@duckburg.quack"
With a simple schema:
[elements]
title = {type = 'string'}
published = {type = 'local-date'}
author = {type = 'table'}
author.name = {type = 'string'}
author.email = {type = 'string'}
The difficulty here is that the schema will be parsed to this:
{
"elements": {
"title": {"type": "string"}
"published": {"type": "local-date"},
"author": {
"type": "table"
"email": {"type": "string"},
"name": {"type": "string"},
},
}
}
And this data structure is a pain to work with, because all values are tables, but some tables represent a schema definition, and some represent further nested keys.
But more importantly, what about:
title = 'How to polish coins'
published = 2023-09-23
[author]
type = "duck"
name = "Donald Duck"
email = "donald@duckburg.quack"
And "type" is already defined so we can't add a key for this:
[elements]
title = {type = 'string'}
published = {type = 'local-date'}
author = {type = 'table'}
author.name = {type = 'string'}
author.type = {type = 'string'} # ALREADY!
author.email = {type = 'string'}
I guess we could make tables implicit, that "solves" it:
[elements]
title = {type = 'string'}
published = {type = 'local-date'}
#author = {type = 'table'} type=table is always implied
author.name = {type = 'string'}
author.type = {type = 'string'}
author.email = {type = 'string'}
But it seems confusing and error-prone. And it creates a new problem as we end up with a data structure like:
{
"elements": {
"title": {"type": "string"}
"published": {"type": "local-date"},
"author": {
"email": {"type": "string"},
"name": {"type": "string"},
"type": {"type": "string"}
},
}
}
So we loop over elements, and the only way to see that "author.type" refers to a type is by checking the value of that being a dict with a type
(and only a type
).
It's all super non-obvious and error-prone to write tooling for this. This is why my earlier comment said: you need to write implementations, because that's when these kind of problems surface.
So instead of using subtables, maybe just always use string keys:
[elements]
'title' = {type = 'string'}
'published' = {type = 'local-date'}
'author' = {type = 'table'}
'author.name' = {type = 'string'}
'author.type' = {type = 'string'}
'author.email' = {type = 'string'}
This always parses to a flat data structure;
{
"elements": {
"title": {"type": "string"},
"published": {"type": "local-date"},
"author": {"type": "table"},
"author.email": {"type": "string"},
"author.name": {"type": "string"},
"author.type": {"type": "string"},
}
}
And is just much easier to work with.
On the other hand, also a bit of a pain to write, and I can see people forgetting those quotes. Although people aren't going to be writing TOML schemas that often, so maybe that's okay?
Either way, I think the current proposal is not going to work out well. I'll continue playing around to see what works, but definitely more work is needed here.
Hopefully some year this comes to fruition. This is my +1 because of an early comment that it's useful for IDE support. I think it needs to accompany the TOML spec, and I"m not sure that a file should declare it... in the same way that a json file doesn't have a declaration for its json-schema. Instead relying on a well known name. The reason it needs to accompany the offical spec though is that I suspect parsers need to know that they need to implement it. Given a schema document, along with a config file, the parser should be able to provide a list of errors.
I wrote a lot of code for this last year, and have a somewhat working (but unfinished and rather ugly) implementation that does validation and some other stuff. I haven't worked on it in a long time though.
My goal was to validate at least most of Cargo.toml and pyproject.toml. These are probably the most widespread TOML files, so that seems like a good place to start.
To do this and actually make it useful I found you need to re-implement significant parts of JSON schema. While the syntax is perhaps a bit nicer, I don't think it's really worth the effort: you can just use JSON schema for TOML – this is what Taplo does for example, and it seems to work well enough. Perhaps there's a few things that can be improved in JSON schema to better support TOML, or how to use JSON-schema with TOML can be documented better, but this seems like the most useful path forward.
My implementation is bad and unfinished enough that I'd rather not put it on GitHub. I actually don't quite remember what bits are and aren't done and what does and doesn't fully work. If someone really wants to work on it I can send it to them, I guess.
The proposed spec here is nowhere nearly sufficient, nor is #116. Initially I thought "well, this is easy – we'll just do JSON schema like but without all that complexity", and slowly discovered a lot of that complexity is needed. It's required complexity. This was not obvious from the outset, and I strongly recommend anyone working on this starting on the implementation rather than specification. This is why my code is so ugly: I had to switch gears several times during development.
I think we should just close this issue. Anyone wanting to work on it can do so – nothing in the TOML specification is preventing that. This is primarily matter of tooling, not a matter of specification. Later, when someone has some working tooling, we can perhaps consider adding a new separate specification for it.
I started a discussion in https://github.com/toml-lang/toml/discussions/1038 as even though I think JSON Schema is the right tool to use to validate the structure and contents of TOML files, I also believe the core TOML project still has a role to play in describing how to do that validation well.
Hi all,
I've been designing, along with @aalmiray, a grammar for a TOML Schema document. The proposal can be found in this repository: toml-schema.
The main difference between a TOML document and a TOML Schema document is the existence of key-value pairs with built-in values (keywords). This is one of the points I'd like to get feedback from the TOML community.
It is not the goal of this proposal to support namespaces. A TOML document cannot embed multiple schemas under nested namespaces. This would make this really, really hard to implement and support, with little to no benefit.
Feel free to comment here and/or create issues on the toml-schema project.
Thanks, bb.