toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.42k stars 845 forks source link

Will custom type syntax be good for TOML health? #603

Closed LongTengDao closed 1 year ago

LongTengDao commented 5 years ago
key = (compute) ' 5 * 60 * 60 '

key = (toSecond) ' 5h '

key = (toTable) [
  ['name', 'age', 'sex'],# head
  ['Jack', '10', 'male'],# item 1
  ['Max', '20', 'male'],# item 2
]

key = (toDOM) '''
  <div>
    <span></span>
  </div>
'''

[table] ('and other custom transform type')

I don't mean the custom type syntax is a replacement of standard types. I am just wondering, maybe the exploration of de facto standards, will facilitate the development of standard types, with less discussion which hard to decide, and avoid these requirement become a dialect which will conflict with spec in the future?

eksortso commented 5 years ago

It's an intriguing idea (definitely post-1.0). May I make a suggestion, though?

Custom types would need to be expected by parsers. Perhaps it would be better to put the custom type after the key's name, to associate that type with the key?

[server1]
timeout (seconds) = 300
header (toDOM) = '''
  <div>
    <span></span>
  </div>
'''
guys (csv) = [
  ['name', 'age', 'sex'],# head
  ['Jack', '10', 'male'],# item 1
  ['Max', '20', 'male'],# item 2
]

The parser, given the smarts to handle them, would produce converted output and handle specified constraints. For instance, header with the (DOM) type would validate the embedded string against a particular HTML standard, and guys with (csv) would hold a string in CSV format equivalent to the 2D array.

One example that intrigues me is the use of units, given this. (See #514.) timeout (seconds) = 3m, for instance, could assign a value of 180 to timeout, which expects seconds but can convert minutes appropriately.

LongTengDao commented 5 years ago

@eksortso Oh, I think that's better and thorough!


Emmm... How would you deal with inline array item type? Array item are same type, I know, but inner tables/arrays in inline array could be different...

eksortso commented 5 years ago

@LongTenDao Well, we could have both types of syntax. Consider two different things: keys with custom type, and values with custom type.

Here's an example that might please a few people, and for good reason. Theguys example above required that all elements be strings. That's a little painful to accept, because ages are numeric. But we can't have numbers in arrays with strings in TOML v0.5.0. But suppose our parser allows an (m) array to be heterogenous ("m" for "mixed type"). We could use something like guy1 = (m)["Jack", 10, "male"], and then use the following for a table that will be converted to a CSV string:

guys (csv) = [
  ['name', 'age', 'sex'],
  (m)['Jack', 10, 'male'],
  (m)['Max', 20, 'male'],
]

Key types and value types could have different meanings for the same type name, although in practice those meanings would be related. For instance, without using units, we could write (with modified syntax; we could allow value types to go before or after the value, but not both):

timeout (s) = 3 (m)

The (s) here means that timeout expects seconds in context, and the (m) here means the value of 3 has a dimension measured in minutes. The parser with appropriate extension logic will see both types, convert 3 minutes to 180 seconds, and assign 180 to timeout.

All of these uses of custom type would be application specific, but their widespread adoption would suggest updates to the TOML standard in the future.

The type expression in parentheses would not conflict with either the key names or the values. Both would be expressed using traditional TOML syntax, unless that type significantly modifies allowed syntax.

This is far from complete, but it's a start. We just need to remember that for configurations, good documentation and proper templates that include custom key types would need to be written for those special types to carry minimal, obvious meaning to naïve readers.

Update: I realized too late that the SI abbreviation for minutes is "min", not "m". Can't help it, though. My point was that key tags may mean something different from value tags with the same names.

LongTengDao commented 5 years ago

@eksortso

Well, we could have both types of syntax. Consider two different things: keys with custom type, and values with custom type.

This's what I thought, with scruple, but —

we could allow value types to go before or after the value, but not both

This's really a good inspiration! It makes things look O&M!


Then what do you think about syntax in ()?

We can simply specify (bare-key-rule), but I think we need more, at least ("single line string just like key"), and (any expression) may be better, to allow custom constructors called with arguments:

guys = (''' csv ( 'name', 'age', 'sex' ) ''') [
  ['Jack', '10', 'male'],
  ['Jack', '10', 'male'],
  ['Jack', '10', 'male'],
]

guys = ( [ 'name', 'age', 'sex' ] ) [
  ['Jack', '10', 'male'],
  ['Jack', '10', 'male'],
  ['Jack', '10', 'male'],
]

How do you feel?

I'm suddenly a little afraid of things going towards:

guys (md) = '''
| name | age | sex |
| ---- | --- | --- |
| 'Jack' | 100 | 'male' |
| 'Jim' | 200 | 'female' |
'''

table (yaml) = '''
a:
  - 1
  - 2
b: null
'''
pradyunsg commented 5 years ago

This is definitely a post-1.0 discussion. It's definitely intriguing.

eksortso commented 5 years ago

@LongTengDao What syntax would () use? The simplest syntax would allow for a single bare-key-style identifier and that's all. Parentheses surrounding a bare word with no whitespace inside.

;; Part of a naively revised ABNF might look like this.
key = key-name [ ws type ]
type = "(" type-name ")"
type-name = 1*( ALPHA / DIGIT / %x2D / %x5F ) ; A-Z / a-z / 0-9 / - / _

key-name = simple-key / dotted-key

This syntax would allow tables, elements of table arrays, and keys within inline tables to have custom types.

[dictionary (ordered)]

To be generous, we could permit multiple type phrases with literal-string-like syntax (except no parentheses and no commas), separated with commas and whitespace.

sign-text (red bg, white interior) = "WRONG WAY"

Value type syntax can be done in a similar fashion.

We probably should just stick to keylike strings for custom type names. Also, I don't think parameters are strictly necessary. If they're needed, then tables can be used to provide them. Here's an example of how we can put a CSV table value (i.e. a string) in the key guys, using a specially constructed table and the (csv) type:

[guys (csv)]
header = ['name', 'age', 'sex']
rows = [
  ['Manny', '100', 'male'],
  ['Moe', '100', 'male'],
  ['Jack', '100', 'male'],
]

Stuff like the (yaml) example could always happen, and I look forward to the future's "Obfuscated TOML" contests. But a sensible template maintainer would not use such a key type. Much less likely would we find parser developers willing to implement YAML in TOML! So we shouldn't worry too much about these sorts of things. Any popular custom type extensions would be promoted with good documentation, useful examples, and plenty of obviousness and minimalism.

To this end, perhaps we can use pragmas, or some variant thereof (see #522) to specify which types are to be accepted by the parser. In combination, these things make specifying the behavior of the types more objective. For instance, imagine over time that units of measure are gradually accepted into the standard. Say that there's an external standard called "units-of-measure" that most parsers will acknowledge. We could see documents written like this:

# TOML v1.n
# units-of-measure v0.1
timeout (s) = 3 (min)

Then later, in a nifty parallel universe after unit dimensions are adopted and variant unit type names are included in the spec:

# TOML v2.n
timeout (seconds) = 3 min

Thoughts?

Edit: Fixed the tag for minutes.

LongTengDao commented 5 years ago
sign-text (red bg, white interior) = "WRONG WAY"

@eksortso You can always get good idea and beautiful example! Wow.

[guys (csv)]
header = ['name', 'age', 'sex']
rows = [
  ['Manny', '100', 'male'],
  ['Moe', '100', 'male'],
  ['Jack', '100', 'male'],
]
ChristianSi commented 5 years ago

In my humble opinion, this is far too complex for TOML, post-1.0 or not. Remember what the 'M' in the name stands for?

LongTengDao commented 5 years ago

In my humble opinion, this is far too complex for TOML, post-1.0 or not. Remember what the 'M' in the name stands for?

@ChristianSi This may depend on whether the complexity is caused by this syntax, or is inherent in actual use. If the latter, then the main purpose of this syntax is precisely to avoid TOML becoming more complex. :)

pradyunsg commented 5 years ago

In my humble opinion, this is far too complex for TOML, post-1.0 or not. Remember what the 'M' in the name stands for?

To be clear, I see this as being similar to equivalent to YAML's tags so I am fairly weary of this. I don't want to block any discussion on this but, I do think it'll be a not-so-easy task to convince me on this FWIW.

eksortso commented 5 years ago

@pradyunsg, But these sorts of tags (I prefer that name, personally) are not defined the way that YAML's tags are. In all cases covered so far, custom types' usages are all parser-dependent. I'd be fine if that's all they ever are. They serve one specific purpose, defined by the app with the parser, and that's it.

But I hope that during our discourse, you can see value in some of these use cases. I'm pretty pleased with the unit-of-measure tags, and those weren't using types other than TOML's integers. Is it not simpler to say timeout (seconds) = 3 (minutes) to set timeout = 180? Is it not self-explanatory (at least, for those who can read English)? Does it not save time? Does it prevent abuse?

But I would never shove (seconds) and (minutes) into the TOML standard. Just a custom type syntax. Or tag syntax. Let others define what (seconds) and (minutes) mean. All that the TOML project would have to do with popular sets of tags is to refer to them in the wiki, or if they're really popular, register them so they can be used in tag pragmas that willing parsers can rely upon. All the heavy lifting would be done away from the core syntax.

LongTengDao commented 5 years ago

@eksortso Let's discuss some edge cases?

A. How do we tag on the array self, not the item table?

[[ array-of-table (on array of the key) ]] (on first table value)

[[ array-of-table ]] (on second table value)

# just like:

key (on key) = (on value, which same here) 'value'

[ table (on table) ]

[ table ] (on table too, same as above)

# Or opposite for aligning reason:

[[ array-of-table (on 1st table) ]] (on array)

[[ array-of-table (on 2nd table) ]]

[ table (and just allow this) ]

[ table ] (not allow this)

# Or, just give the parser:

[[ array-of-table (tag-a which parser known for table, tag-b which parser known for array) ]]

How do you like this?

I also want to know why you prefer [ key.key (tag) ] than [ key.key ] (tag)?

I think whatever the final choice, our basis should be consistent firstly: being intuitive is the premise, then is reasonable and unified?

B. How do we tag on table which not directly appeared?

a.b (plan a)
a.b (plan b).x = 1
a.b.y.z = 2

Or just forbidden it, request a.b (plan c) = { x = 1, y.z = 2 } instead?

C. How do we tag on the root table?

(initial lone tag before all expression) # Assume there is a [] root table open statement before the tag?

D. What's the order of tag processing?

key (2nd) = 'value' (1st)
array (5th) = (4th) [
    (3rd) 'item' (2nd),
] (1st)
array = [
    (3rd) [
        'item' (2nd),
    ] (1st),
]

Just in the reverse order they occur like above? Or always from inner to outer, like below?

array (4th) = [
    (2nd) 'item' (1st),
] (3rd)
array = [
    (4th) [
        (2nd) 'item' (1st),
    ] (3rd),
]

Or, from inner to outer, but give the same level tags to the parser at the same time which refer to different meanings:

key (key tag) = 'value' (value tag)

array (key tag) = (invalid -- or value tag for long value) [
    (index tag) 'value' (value tag)
    # but what if here needs a pre value tag for long value
] (value tag)

Sample in JS parser:

function tagProcessorForEach(parent, keyOrIndex, keyOrIndexTag, value, valueTag) { }

Or, only these (key tags) are valid:

[ table (tag) ]

key (tag) = 'value'

array (tag) = [
    (tag) 'value' # tag at where index ("key" for array item) should be, which also nice for long value
]

# Or:

[ table ] (tag) # before content start; and "tag" is just like "comment", after the "[]" statement

key = (tag) 'value'

array = (tag) [
    (tag) 'value' # tag at the front of value always. same result, just another explanation of consistency
]

# Perhaps the post-tag is pretty for primitive values
# (including Boolean, Integer, Float, String, Date-time,
# because parentheses are usually post-tagged in natural language),
# but it can be overlooked for longer strings,
# so should we disable post-tags for strings?
# Or limit the length?
# Or is it better to just allow post-tags for Integer?

key = 1 (s)

# Or the easiest way to do is to leave it out,
# because combinatorial usage is a little less O&M?
# After all, in the first explanation, it's almost always the latter.

And then the order is always from inner to outer.

eksortso commented 5 years ago

@LongTengDao You gave me a lot to think about. Here's my take on the subjects that you raised.

TL;DR

A. How do we tag on the array self, not the item table?

My preference is to bind a key tag to its key's name. The syntax [table-name (with-tag)] does just that.

That said, an exception needs to be made for table arrays. The table array syntax defines the name of the array, but there's no simple way to explicitly separate the array from its elements. The [[]] line starts an element table belonging to the array, and it's expected that the line will appear more than once.

If it's needed, perhaps we allow a tag after the double brackets on the first element table of the array. It would be invalid after any element table line beyond the first one.

# My preference:
[[array-of-tables]] (array-tag) #permitted on 1st [[array-of-tables]]
#...
[[array-of-tables]] #tag not permitted here
#...
[[array-of-tables (table-tag)]] #we can still tag individual elements
#...

This is very much an exception to the norm, as you're about to see.

B. How do we tag on table which not directly appeared?

My preference is, if we can't refer to the key, we can't tag it. So something like a.b.x = 1 would not permit b to be tagged. Something like a.b (plan c) = { x = 1, y.z = 2 } ought to be fine.

By the way, despite my earlier slightly enthusiastic comments, I've come to prefer single bare-key-like tag names. So I'd actually prefer a.b (plan-c) over a.b (plan c).

C. How do we tag on the root table?

My preference: We can't. The root table is never explicitly specified, so per B., it can't be tagged. Besides, the application defines the top level's significance, so that shouldn't change.

D. What's the order of tag processing?

I would prefer key (1st) = 'value' (2nd). The example timeout (seconds) = ... requires timeout to expect either a numeric value representing seconds, or a number tagged with a unit of time measurement. I would rather have that expectation in place before the value is processed. If you can think of examples where it would make more sense to handle the value's tag first, definitely share them.

Collection types may affect how the contents are processed. Consider the example [dictionary (sorted)], which implies that if the keys of dictionary were read into an array, they'd be in order, so they would need to be sorted as they're added. The parser would need to know that ahead of time. Also consider the example of the hypothetical mixed-type array (m)[(and-then-there_s) 'Maude', 58, 'female']. That (m) needs to be recognized first, because for a standard TOML array (as of v0.5.0), the 58 would throw an error.

So I would prefer array (1st) = [ (3rd) 'item', ] (2nd). Or rather, array (1st) = (2nd) ['item' (3rd),] which reads a little more nicely. The same would apply to tables, both standard and inline.

Tag ordering should not change depending on whether a tag appears before or after a value. Please recall: tags can only go on one side of a value. So this is invalid: key = (a) 'item' (b) # INVALID.

In any case, we handle, in the order that they appear, all the tags on each key-value pair or table header as they appear in the document.

[men-with-no-name (1st)]
man-afod = "Joe" (2nd)
man-fafdm (3rd) = "Monco"

[men-with-no-name.man-x (4th)]
who="Blondie" (6th)
whos_who (7th) = (8th) ["OK" (9th), "NG" (10th), "UGH" (11th)]

I think tags on non-collection values can be permitted to go on either side of the value (only on one side per value though). And I wouldn't want to exclude their use.

reason = (good-point) "Like you said, pre-tags make sense when you're dealing with long strings."

But I do think tags on collection values ought to come before the collection, and that's easy to show why.

reason-with-commentary = [
    "There's value in readability.",
    "We wouldn't want to obscure the tag.",
    # Though the reasons aren't always obvious at first glance
    "But when you're dealing with fine details",
    "which may be altered by the presense of the tag,",
    "It's important to put that up front",
    "because you'd need to go through the entire collection",
    "before you realize that there is a tag",
    "and discover that you have to deal with it in a particular",
    "way.",
    "This isn't for the parser so much as it's for the human beings",
    "who have to read the TOML code, even if they don't need to.",

] (TL_DR)    # IF THAT POST-TAG ISN'T INVALID, IT OUGHTA BE!

Other Stuff

Returning to an aside that I made earlier, I mentioned the idea of a tag-set registry. This would include, among other things, an online reference of the meanings of various related tags, blessed by the TOML community for each set's merits. Such a registry would value obviousness, minimalism, clean syntax, a high degree of useability, and very little screwing with stuff that doesn't need screwed with.

Such a registry would use URLs, and I do advocate for bare-key-like tag names, which would require little conversion if they're typed in blindly by human beings, or by IDEs that are just trying to be useful.

Thoughts on any of these things?

LongTengDao commented 5 years ago

@eksortso Only discussing one feature is so complex, how difficult it was for Tom to invent TOML! XD

  1. [ ] 1. Tag key names and values only, but allow a special exception for table arrays' tag syntax.
  2. [x] 2. No tagging allowed on the insides of dotted keys.
  3. [x] 3. No tagging allowed on the root table.
  4. [ ] 4. Handle key tags before value tags.
  5. [ ] 5. Handle tags on collections, like arrays and inline tables, before handling what's inside them.
  6. [x] 6. Tags on collections ought to come before, not after, they are specified.
  7. [x] 7. For all other value types, put tags before or after the value. But not both.
  8. [ ] 8. Put tags after keys. (I don't discuss that here, but just remember this.)
  9. [ ] 9. I prefer tags that look like single bare keys, for a few reasons.
  10. [ ] 10. A tag-set registry would still be a good idea.

The items I checked look good to me.

4 & 5

Positive sequence to deal tags maybe not possible, when I try to implement it in my parser (ltd/j-toml/xOptions/tag)... Consider this: (I break lines in example inline table to see clearly)

grand (tag-a) = {
  parent (tag-b) = {
    child (tag-c) = 'value'
  }
}

When process tag for each level, inner level can not read outer level information more than one layer, at least not easy (like dom.parentNode api...), but outer level can easily get any deep inner level information if it need, so I think inner tag is just preliminary preparation, the order to handle tag should be from after to before.

1 & 8 & 9

[[ a (do-a-to-first_do-b-to-first) ]] (do-x-to-array)
[[ a (do-c-to-second) ]]
[[a]] (do a to first, do b to first, do x to every item)
[[a]] (do c to second)

I think the latter one looks more clear (avoid overwhelm conspicuousness of []), which you suggested earlier... and the array/table is send to processor at the same time, which is target depends on the tag content and parser.

10

Did you mentioned the idea of tag-set registry before? Sorry I didn't see it, and can't find it...

I'm not sure what you mean. If it's used for parser, I think that's good; if it's used for .toml file write, I think maybe that's not good... Because tag is a syntax invented to avoid custom conflict with future standard syntax, if using tag will still has spec, it's not tag any more... Would you explain it more?

drunkwcodes commented 5 years ago

I would like to have a symbol or a term to determine which lines are manually defined in the case of designing a type syntax.

LongTengDao commented 5 years ago

I would like to have a symbol or a term to determine which lines are manually defined in the case of designing a type system.

What do you mean? Currently, all the examples in this issue, use ( ) as the symbol.

drunkwcodes commented 5 years ago

I mean it would be good to see hand-written types are distinguishable from annotation types which are auto generated and be written in place.

In that case, we can simplify the generated types more easily. The conflicts are also easier to be resolved when doing a revision.

LongTengDao commented 5 years ago

I mean it would be good to see hand-written types are distinguishable from annotation types which are auto generated and be written in place.

In that case, we can simplify the generated types more easily. The conflicts are also easier to be resolved when doing a revision.

@drunkwcodes Sorry, I think I need some help.

@eksortso Hi, could you understand what these above mean?

drunkwcodes commented 5 years ago

Parenthesis have too many useful meanings besides of noting types.

I got an idea. : for hand-written types and :-> for those calculated types.

LongTengDao commented 5 years ago

Parenthesis have too many useful meanings besides of noting types.

I got an idea. : for hand-written types and :=> for those calculated types.

Personally, it's hard for me to distinguish between type and calculating, like below:

[a] (table)
head = ['name','age','sex']
body = [
  ['Tom', '19', 'male'],
  ['Jack', '20', 'male'],
]

It's a type, also calculating

drunkwcodes commented 5 years ago

It would be something like this after the first pass.

[a] : table  # unnecessary for it's built-in.
head :-> 1×3 str array = ['name','age','sex']
body :-> 2×3 str matrix = [
  ['Tom', '19', 'male'],
  ['Jack', '20', 'male'],
]

It has canonical types to describe the data. Having a type declaration(in different lines) is even better.

So we know that it's a 2-by-3 string table with one-line header at the first glance.

A delimiter like , appending units after types sounds more sufficiently.

TheElectronWill commented 5 years ago

Couldn't we just use (customType)? I fail to see the point of using the uncommon :-> symbol.

In toml parentheses don't have many meanings, so something like this wouldn't be ambiguous.

[a] # table because of the brackets

head (1x3 string array) = [...]
body (2x3 string matrix) = [...]
LongTengDao commented 5 years ago

@drunkwcodes Hi

The colon is very close to the semantic status of the equal sign, and data file formats generally avoid using both as much as possible, such as YAML and JSON with the colon and INI/TOML with the equal sign.

But it reminded me of TypeScript, which might help https://github.com/toml-lang/toml/pull/116#issuecomment-473600873:

config.js
config.d.ts

config.toml
config.d.toml

But it also means that the colon gives me a validator comment sense of "equivalence" rather than "extra transform", which similar to below but with grammar effectiveness:

[a] # table

head = [ ] # 1×3 str array
body = [ ] # 2×3 str matrix
drunkwcodes commented 5 years ago

Exactly. It's all about readability. Not like parentheses, which make me wonder how much meaning it would be in a markup language.

LongTengDao commented 5 years ago

@drunkwcodes

Exactly. It's all about readiness.

I think you want to write "readness" which maybe means "readability"?

Not like parentheses, which make me wondering how much meaning it would be in a markup language.

Currently, it's mainly used for exploring new type, which may be not good to wholesale add into spec.

The date-time*4 types are examples, which are obviously differ from other types (primitive types and structure types). It's useful, but time duration is also useful, and there are so many types useful under various situations, which more like syntactic sugar (for { date = 'yyyy-mm-dd', time = 'hh:mm:ss.ddd', offset = 0 } and '5h'), whether support them or not is both with problems. So think between these:

TOML v0.5—without any sugar:

# a moment
attendance = { time = '09:00:00', offset = '+08:00' }

# a duration
rest.from = { time = '12:00:00', offset = '+08:00' }
rest.to = { time = '13:00:00', offset = '+08:00'  }

# a period
work = 28800

TOML v0.5—with custom type syntax:

attendance (moment) = '09:00:00' (+08:00)

rest (duration) = '12:00:00 ~ 13:00:00' (+08:00)

work (period) = '8h' (s)

TOML v50—add all into spec:

attendance = 09:00:00+08:00

rest = 12:00:00+08:00 ~ 13:00:00+08:00

work = 8h0s
Gin-Quin commented 5 years ago

Hello there. Here are my thoughts after all this reading. I agree with the upper comment about bringing complexity to a language which is intended to be simple. I think tags can be an awesome idea, but the usefulness/complexity ratio should be considered.

Type inference

Adding types to a language is something that has been very much thought. The most recent programming languages like Kotlin, Swift, TypeScript, are all typed languages with type inference, and I think there is a good reason why. Types bring stability, clarity. Type inference brings ease of programming for humans. Writing myNumber = 121 (with type inference) is more readable than myNumber (Integer) = 121, because we are humans behind our computers, and we know what obviously is a number, or an URL, or a date, etc... The present 0.5.0 TOML version uses inference about numbers, dates, strings, and I think that is an awesome feature. (I would think about extending it to URLs)

About the syntax

What about using the same syntax as Swift, TypeScript and Kotlin instead of a C-like syntax?

myAge : Number  =  121
myName : String  =  "Zabu"
myBody : BodyPhysic  =  {
  weight = 12,
  height = 14,
  speed = 37,
  eyes = 'Huge'
}
myFriends: [String]  =  [ "Coco", "Bubu"]

Are explicit types necessary?

Kotlin, Swift, they have type inferences, but also explicit types when it is necessary. But they are programming languages, not configuration/object notation languages. Does TOML need explicit types? Let's see some concrete examples :

work (period) = '8h' (s)
timeout (s) = 3 (min)

When I look at this code, I have a feeling "That's cool" mixed with another feeling : "That's complicated" :p There is a left and a right type, which means the TOML parser must know both types and how to convert from one type to another. Plus, OK, that kind of double-type conversions is cool about seconds, and also temperatures (Kelvin/Celsius) and anything like that, but practically, I think it's too much work for just some sugar syntax.

hoursOfWork = 8
timeout = 180  # seconds

That's not as cool, I agree. The human has to convert minutes to seconds himself. But it works fine. There is no ambiguity, thanks to the key name or the comment. And of course the parsing is a lot easier and faster to do.

There is another issue with those kind of conversions : if you create an object from TOML, and then convert back the object to TOML (with a stringify function), you will lose all your type informations.

About that kind of code :

body :-> 2×3 str matrix = [
  ['Tom', '19', 'male'],
  ['Jack', '20', 'male'],
]

I think type inference is the best. For me, the (2×3 str matrix) part - however it is written - is not very useful. Humans know what they see, and so should do the parser. Since all array elements must have the same type, it is not so hard to infere array types.

Now, my favorite point :

[guys] : Csv
header = ['name', 'age', 'sex']
rows = [
  ['Manny', '100', 'male'],
  ['Moe', '100', 'male'],
  ['Jack', '100', 'male'],
]

Ok. Here I see true potential for tags.

User-defined classes

In this example, the user has a CSV object that he wants to convert to/from TOML. Maybe this CSV object has methods, like addRow or something. Maybe it has a special data organization (maybe not a header and a rows properties). Plus no all users would need a CSV converter imported with their TOML parser. The idea is to use the CSV constructor (defined by the user) to create a CSV object. Then, instead of having this standard map object as a result :

{
  "guys": {
    "header": ["name", "age", "sex"],
    "rows": [...]
  }
}

we would get a true CSV object by passing the resulting Map object to the CSV constructor. Then the user could also call CSV methods on the result object :

data = TOML.parse(tomlContent)
// data.guys is now a CSV object, so we can call methods :
data.guys.addRow('Hello', '100',  'Toml')

It can work not only with CSV, but with any objects you use in your project, if you've defined a valid constructor. This constructor just has to be accessible by the parser.

Another example, with an user who needs to work with Books :

books : [Book] = [
  { title = "Bees are cool", author = "BeeLover", chapters = 121 },
  { title = "Dogs+me", author = "DogLiker", chapters = 2 },
]

or...

books : [Book]

[[books]]
title = "Bees are cool"
author = "BeeLover"
chapters = 121

Advantages of this idea :

drunkwcodes commented 5 years ago

@Lepzulnag I like this comparison.

TOML is familiar and formal by now, and this is a type syntax which will be superior to those in programming languages right here. But I like it.

I just googled those type syntax. I may be mistaken.

  1. Type annotation mark in Kotlin is as pure as a : as for returned types.

  2. Swift has innovative ? and ! for optional types. (I think it should be just ? in here.)

  3. Typescript is... nothing special.



The unnecessary parts are for having a room to write down the details.



We are going to pick those symbols up, as much formal, informative and readable as possible.

Because it is TOML.

LongTengDao commented 5 years ago

Indeed, "type inference" and "custom type" (or "user-defined classes") are two things.

Whether (*) or :*, this issue is discussing "custom type".

BTW: "type inference" is intended for variable deassigning, computing, and passing in api, these only happen in programming language, because configure language is static (and it's complexity is exactly the same with type syntax, they are nothing different to computer):

let a :string = 'abc'; // Without the actions and possible errors below,
                       // there is no need to hand-write a type,
                       // because type syntax is almost the same to value syntax
a = {}; // deassign error
a = a+1; // compute error
! function (p :boolean) { }(a) // passing error

@Lepzulnag So at present, the problem may be: Is the thing whether we call it "custem type" "tag" or "user-defined classes" available for inline element? Inline table and inline array are also an object, and even string literal like date-time and url, is also going to be an object, how to express them if using :?

[block-table] (classA)
inline-array (classB) = [
  (classC) { }
]
inline-primitive (classD) = '/path/to/something'

Do you want one in below?

[block-table] : classA
inline-array :classB = [
  classC: { }
]
inline-primitive :classD = '/path/to/something'
[block-table] : classA
inline-array = classB: [
  classC: { }
]
inline-primitive = classD: '/path/to/something'
[block-table] : classA
inline-array = classB: [
  classC: { }
]
inline-primitive = '/path/to/something' : classD

And limit that custom type must be returned as an object type by the plugins in parser (which is mainly used for configure format), in exchange for better support of stringification (which still has many other untenable things, like an object is whether inline or not, and how to reserve dot keys)?

drunkwcodes commented 5 years ago

@LongTengDao It's a breakthrough from the beginning.

@Lepzulnag A table will not be a list.

So it is clear that it will not have more than one pair of brackets in that expression.

What is the point to annotate type next to the tables?

To have arbitrary table names, having the same type, and to be distinguishable from other mysterious table types in one file?

Gin-Quin commented 5 years ago

@LongTengDao I think Custom types should definitely be available for inline elements. There will be cases where users will need custom inline elements.

Here are some syntax proposals!

(For me, the ' : type' or the '( type )' syntaxes are about the same, except maybe for arrays (see below), where I have a preference for the colon)

Simple values

Nothing too complicated here.

hello : MagicString = "world"  

Tables

Here is a first proposal for tables, sticking to other languages syntax :

blob : Blob = {
  innerBlob : InnerBlob = {
    id = 51312
    size : MagicNumber = 3212
  }
}

[blob] : Blob  # With no type, this line would be useless
[blob.innerBlob] : InnerBlob
id = 51312
size : MagicNumber = 3212

Here is another proposal, that I think is very elegant :

blob = Blob {
  innerBlob = InnerBlob {
    id = 51312
    size : MagicNumber = 3212   # The ' : ' syntax is still necessary here
  }
}

But it does not completely remove the need to use a colon (or parenthesis) in some cases.

Arrays

For Arrays, things can get more complex. An Array can have a custom type itself (like, in C++, a class extending the vector class), with its own user-defined methods. This case is easy, because they then work like tables :

myArray : MagicArray = ["Hello", "World"]  # MagicArray has its own methods

# .. or, with the second proposal ..

myArray = MagicArray ["Hello", "World"]  # MagicArray has its own methods

But it can also be a standard array with custom objects. Or a standard array of standard arrays of a custom object. In C++ or Java languages, it is written this way :

Array<Blob>  # an array of blobs
Array<Array<Blob>>  # an array of arrays of blobs

But Swift brings this very intelligent notation :

[Blob]  # an array of blobs
[[Blob]]  # an array of arrays of blobs

If we use the Swift syntax :

myCuteBlobs : [Blob] = [
  { *blob1* },
  { *blob2* },
  ...
]

# ... or ... (using parenthesis)

myCuteBlobs ([Blob]) = [...]

# ... or ... (second proposal)

myCuteBlobs = [Blob] [
  { *blob1* },
  { *blob2* },
  ...
]

What do you think?

I feel like in that case, the use of parenthesis is a bit weird.

But I mostly feel there are too many [ and ] characters. Users could get confused about the two [[arrayOfArraysOfSomething]] and [[newTableOfArray]] syntaxes.

So here is a last proposal for arrays, using chevrons instead of braces for better readability :

myCuteBlobs : <Blob> = [
  { *blob1* },
  { *blob2* },
  ...
]

# ... or ...

myCuteBlobs = <Blob> [
  { *blob1* },
  { *blob2* },
  ...
]
drunkwcodes commented 5 years ago

It's not chevrons. ^ is.

I have an idea: There are three types of blobs.

Blobs, Blobs with inner blobs, and formated blobs.

myCuteBlobs : ^Blob^ = {
  { *blob1* },
  { *blob2* },
  ...
}

# ... or ...

myCuteBlobs : ^^Blob^^ = [
  { *blob1* },
  { *blob2* },
  ...
]

# ... or ... auto inserts ',' between blobs

myCuteBlobs : ^^^Blob^^^ = [
  { *blob1* }
  { *blob2* }
  ...
]
LongTengDao commented 5 years ago

@Lepzulnag

The last one,

[key] : <tag>
key : <tag> = 'value'
key = <tag> [
    <tag> 'value',
    <tag> 'value',
]

This looks pretty good.

The weakness of : is that, people can't obviously figure out which is key and which is constructor and which is value in single key/value pair key:tag = tag:value. And : can't designate the right boundary of the tag literal, so we can't use complex characters. (^ has this weakness too @drunkwcodes )

The weakness of () is that, it's a little strange to use is before value.

I think use them together is pretty good. And once we use paired symbol, <<name, a>b>> escaping is possible (or maybe we should discuss the escaping syntax now? Is it right to allow complex characters in tag? If yes, we use toml string format, even inline value format to pass table arguments, or just <...> <<...>> <<<...>>>?)

drunkwcodes commented 5 years ago

@LongTengDao Hey we're on the same track now.

But also, using () here sounds like it's going to be a promise for a type casting while we're talking types.

It can't be guaranteed by the language now. And it misleads people to a certain approach.

The angle brackets are forbidden in here too. It will introduce escaping syntax certainly, e.g. <(`^´)> custom type.

Gin-Quin commented 5 years ago

Since tag names are constructors/classes, in my opinion they should be valid variable names with no special character allowed.

drunkwcodes commented 5 years ago

Ok ok, it's a deal. They should be good "variable names." And the diversity of types should be less than variable names, obviously.

Type Syntax 101:

Given a class named "Variable," and a variable named "class," which is "3", we would write class < Variable > = "3"

This makes me feel like a theoretical badass.

eksortso commented 5 years ago

Apologies for being away for so long. @LongTengDao, after I catch up on my reading, I'll provide you with a proper response. This may take a few days.

LongTengDao commented 5 years ago

Apologies for being away for so long. @LongTengDao, after I catch up on my reading, I'll provide you with a proper response. This may take a few days.

Nice to see you back! Short to view:

# formats look good but not full covered (previous):
[section (tag)]
key (tag) = [
    [
        'value' (tag),
    ],
]

# formats look good and full covered (now):
[section] : <tag>
key : <tag> = 'value' # optional
key = <tag> [
    <tag> [
        <tag> 'value',
    ],
]

# do you think using both of them is good (next)?
[section] (tag)
key = <tag> [
   'value' (tag),
   'value' (tag),
]
eksortso commented 5 years ago

It seems like the discussion went way off-track during this thread, primarily because the discussion veered away from tag-like behavior, which is fairly simple in nature, to type-like behavior, which goes well beyond TOML's intentional simplicity. And I have to admit, I've only made this confusion worse with my examples.

Discussion of how TOML tags, after parsing, are interpreted as custom object types is better saved for those particular applications. I won't be discussing those subjects under this issue any longer than necessary. I'm just going to focus on how to do the tagging.

That said, I fully endorse some of @Lepzulnag's principles, restated here (and with my commentary):

There are plenty more topics to address, and I'd like to keep them separate, going forward.

eksortso commented 5 years ago

A few suggestions for alternatives to parentheses were made. I don't like the colon-based syntax; visually, it's too discreet. Because of their applications, tags ought to stand out!

I do like angle brackets, as it turns out. Something like timeout <seconds> = 180 is appealing to read. Plus, using angle brackets in place of parentheses makes the tags look like HTML or XML, which is widely recognized, though the usage is different.

Tags should follow the same format that keys follow. If you can use a tag's name as if it were a key, then it ought to be good. <bareword> tags would be better than <"complicated or ornate⚝"> tags

Thoughts?

eksortso commented 5 years ago

@LongTengDao, let me split up your last post into a few different posts. A number of things that you mentioned need to be addressed.

With the mentioning of "optional" types, it may be worth revisiting the notion of a null value. Nulls hold no real meaning in configuration; either values are defined or they're not. But in general data description, null means that a value is not defined and that fact is stated explicitly. Since all languages effectively have a "null" object type, it would be painless (post v1.0) to bring null into TOML.

Also, this suggests that tag names ought to allow for a ? suffix, to suggest that the value can possibly be null. It doesn't need to be an explicit requirement for the use of null as a value, but it couldn't hurt.

[bikeshed]
paint <color?> = null

This begs a question of whether tags <color> and <color?> are related. We can leave that question to the application using the tags to answer.

LongTengDao commented 5 years ago

@eksortso

  • [x] spec and parser should not consider how the tags will be explained

Sure. It's designed for custom feature.

  • [ ] beyond suggesting fundamental types for well-defined syntaxes

I think spec should never interfere the tag feature, to promise custom tag will never be conflict with official feature when upgrade spec version. Unless the spec tell what format tags are reserved in the beginning. But I still suggest official features to use 5h, not to touch <hour> 5.

  • [x] Possibility to easily implement plugins to deal with widely-used types (each such plugin would recommend its own set of tags for use, each defining their own application of TOML with tags)

Yes. In my parser's experimental implementation, it's easy to combined use plugins:

const TOML = require('@ltd/j-toml');
const toml_plugin_a = require('...');
const toml_plugin_b = require('...');

const sourceContent = `
    x = <tag-x> 'value'
    y = <tag-y> 'value'
    z = <tag-z> 'value'
`;

const rootTable = TOML.parse(sourceContent, 0.5, '\n', true, {
    mix: true,
    tag ({ table, key, tag }) {
        switch (tag) {
            case 'tag-x':
            case 'tag-y':
                toml_plugin_a({ table, key, tag });
                break;
            case 'tag-z':
                toml_plugin_b({ table, key, tag });
                break;
            default:
                throw Error('Unknown TOML tag: <'+tag+'>.');
        }
    },
});

  • [x] not use :

Personally, I'm both okay, whether use or not. I leave this point to other discussants.

  • [x] <tag> is better than (tag)

Yeah, if only one. Because () usually used after words, while in some cases must before (before array items; before long value like long string, inline array, inline table). Unless use both of them.

  • [ ] <tag> rule is same as bared key

LGTM. In the future when attributes are necessary, it will be <"a {b:c}"> or <a b="c"> or other things looks good. But currently, let's discuss, how to write multi tags ((tag-a, tag-b) in previous discussion)?


  • [ ] about <tag?>

I can't see why we need this, is there any relation to tag topic? Tags are intended for type conversion, not validator.


One more question.

This rule is more unified (always before value):

key = <tag> [
    <tag> [
        <tag> 'value',
        <tag> 'value',
    ],
]

Do you still think tag after keys is better, when we stop using () (which is usually after things)?

key <tag> = [
    <tag> [
        <tag> 'value',
        <tag> 'value',
    ],
]

key.key<tag> also makes me associating tags for each node (key<tag>.key<tag>), which has no meaning.

LongTengDao commented 5 years ago

Found just now and marked for reference:

Add simple tagging #346

@vagoff Welcome to join the discussion! Time flies and good days come~


Is it possible to append a previously defined value? #612 A reserved syntax for user extension in strings #445

Maybe the custom syntax is also suitable for variable reference requirement:

[tool]
basepath = "~/apps/mytool"
binpath = <mustache> "{{ tool.basepath }}/bin"
[tool]
basepath = "~/apps/mytool"
binpath = <es6> "${ tool.basepath }/bin"

Other related issues collection:

Feature request: Add a duration/timedelta type #514 Artihmetic expression as values #582

eksortso commented 5 years ago

Pardon me, @LongTengDao, for not responding earlier.

  • [ ] beyond suggesting fundamental types for well-defined syntaxes

I think spec should never interfere the tag feature, to promise custom tag will never be conflict with official feature when upgrade spec version. Unless the spec tell what format tags are reserved in the beginning.

Well, the spec would set some expectations for how the parser handles the tags. In line with a minimal approach, a parser can assign a tag to a key or a value on a first pass, and then, later on, apply special typing or other features based on how the tags are interpreted by the parser plugin. The assignment part would be part of the TOML standard. The interpretation goes beyond the standard. (I said "plugin" because your experimental parser uses plugins, but a plugin isn't strictly necessary to offer special functionality.)

But I still suggest official features to use 5h, not to touch <hour> 5.

Agreed. That's another issue now.

I'd rather save multi-tagging for later discussion.

  • [ ] about <tag?>

I can't see why we need this, is there any relation to tag topic? Tags are intended for type conversion, not validator.

It wasn't my intention to use <tag?> to validate a tagged value. I was saying that a key tag with a question mark at the end of the tag's name could, by convention, suggest that the key can explicitly hold a null value, and in some cases may encourage it.

Some languages have "optional" types that are identical to a simpler type except that they allow null values. A tag application like paint <color?> would mean that a config setting paint would be declared to be an optional Color type. This could be made 100% implicit, and some languages (like SQL, for better or worse) just work that way.

But when someone writes and presents you with a configuration template like paint <color?> =, you can rest assured that you could ignore the question of what color your paint is and just supply a null, knowing the program that would use the filled-out config can handle paint being null.

Do you still think tag after keys is better, when we stop using () (which is usually after things)?

key <tag> = [
    <tag> [
        <tag> 'value',
        <tag> 'value',
    ],
]

My conception of key tags is that they put expectations on the values assigned to those keys. So it's like paint <color?> = tells the user just before the equals sign that, hey, the paint option is gonna be a color, however you define it, or it could be nothing. The phrasing <color?> paint doesn't carry that same semantic weight.

key.key<tag> also makes me associating tags for each node (key<tag>.key<tag>), which has no meaning.

I don't see that at all. Maybe I'm revisiting the same example too much, but bikeshed.paint <color> makes paint colorful, even if the bikeshed isn't.

jeff-hykin commented 3 years ago

I think the conversation on this topic is going well.

Just some feedback from an outsider to confirm some things and point out some other things (hopefully this will be helpful).

arp242 commented 2 years ago

There are two things here:

  1. Attach a tag to keys (key (tag) = value)
  2. Attach a tag to values (key = value (tag) or k = (tag) value).

I think the second item makes things too complex, stuff like this:

array (5th) = (4th) [
    (3rd) 'item' (2nd),
] (1st)

Is pretty hard to understand.

Even things like this seems too complex to me:

timeout (s) = 3 (m)

And you can just use a different min tag for the key instead:

timeout (min) = 3

Which I find much more obvious.


Personally I think the key (tag) = value syntax might be a good thing to add, but nothing more. This is:

  1. Still fairly obvious.
  2. Not too complex.
  3. Easy to implement.
  4. Provides enough flexibility for most cases.

Also maybe adding multiple tags might be a good idea:

img1 (base64,png) = '''[..]'''
img2 (base64,jpeg) = '''[..]'''

On the other hand, all of this will be implementation-defined, and an implementation can already do the same with just regular strings:

img = '''base64,png: [..]'''

key = 'compute: 5 * 60 * 60'

timeout = 'min: 3'

Which avoids having to add any syntax. I'm not so sure if the k (t) = v syntax really has much advantage over this, other than looking a bit nicer.

jeff-hykin commented 2 years ago

Personally I think the key (tag) = value syntax might be a good thing to add, but nothing more. This is:

1. Still fairly obvious.

2. Not too complex.

3. Easy to implement.

4. Provides enough flexibility for most cases.

I agree with all this.

On the other hand, all of this will be implementation-defined, and an implementation can already do the same with just regular strings:

img = '''base64,png: [..]'''

key = 'compute: 5 * 60 * 60'

timeout = 'min: 3'

But this creates a big string escaping problem. What if we want the value to literally be the string 'compute: 5 60 60' we then have to define an escaping mechanism like 'string: "compute: 5 60 60"' and now basically all strings that contain colons need to use that syntax: which can be a painfully sharp non-standard edge case.

That problem^ is the main reason I'm advocating for tags. Because without tags there are hard-coded assumption and painful edgecases, on top of a lack of standards and custom/manual parsing

arp242 commented 2 years ago

But this creates a big string escaping problem. What if we want the value to literally be the string 'compute: 5 60 60' we then have to define an escaping mechanism like 'string: "compute: 5 60 60"' and now basically all strings that contain colons need to use that syntax: which can be a painfully sharp non-standard edge case.

Ah yeah, that's a good point; and every application that wants something like this will have to figure out escaping as well.

tintin10q commented 1 year ago

Nearly all security problems come from user input.

I don't think the ability for the TOML parser to parse further according to tags should be in the specification.

In theory the syntax is obvious and minimal but in real life it won't be. In my opinion, after the parser has parsed the TOML file it should simply return primitive values native to most programming languages to the application. The application can then decide how to interpret and use these values. The application will know what it needs to do. If a TOML parser implementation is allowed to automatically parse input further based on tags, you are taking control away from the main application and tie/lock it to the implementation of the parser. This actually makes everything infinitely more complex and brings potential security vulnerabilities.

With tags, you could trick parsers to parse complicated things and every parsers' implementation will support different tags. For example, with a #png tag you could force a parser to start parsing a png image or html with #html and these are not trival at all to safely parse. To parse these languages, the TOML parser implementations will likely use other peoples parsers. This means that the TOML parser function could actually call many different parsers for different languages. Even if the application doesn't expect a #png value, but it was supplied anyway. All, these other parses for other languages have to be trusted.

What if the parser implementation decides that if you put a #png tag that it has to read it as a filename and then read a file from disk? Someone could put the name of a password file? You could say parser implementers won't do this. But they could if the tags exist because it allows for it. The implementers and users of TOML have to be protected from this. TOML doesn't know enough about where it is used to make this decision, but that doesn't mean people won't attempt to make these decisions.

Tags will have different behaviours between different programming languages and different implementations. A #base64 tag as bytes in python but maybe in JavaScript it makes a bytes array. One is mutable one is not. This means you get different behaviour and outcomes for the same tags and for this to be secure in your application you have to look at the documentation of the parser for what could happen. You really should only have to look at the TOML spec to know what your TOML file will do.

This proposal opens up the ability for the TOML parser to arbitrary call other parsers (or any code) based on what the TOML parser you're using implemented. That is a terrible and dangerous idea.

What all these points have in common is undefined behaviour. Adding tags by definition adds undefined behaviour to the spec because the implementations can do whatever they want when they encounter a tag. Adding undefined behaviour is a really, really bad idea.

eksortso commented 1 year ago

@tintin10q At the heart of your criticisms is your sentiment, which I find myself agreeing with wholeheartedly:

You really should only have to look at the TOML spec to know what your TOML file will do.

This is appealing because it enforces the principles of obviousness and minimalism. (This actually states your case more strongly than bringing up "undefined behavior," which is arguably bad practice in programming language specs, but TOML is not a programming language. But I digress.)

In our discussions, we talked about arbitrary things that tags could do, which certainly falls outside the scope of TOML and which I must admit I was speculating about at length without security concerns. So I am changing my tone, but I have a different approach now, which I will elaborate on below. But first, let me see if I understand your point of view.

Allowing the possibility for arbitrary behavior in the specification could make it seem like we encourage abuses of the syntax. We certainly don't. However, violations are already possible without changes to the spec, because some parsers read and preserve comments, and some consumers may read those comments and make changes to their configurations. Currently, we do not make explicit that comments ought to be ignored by parsers, because format-preserving parsers and TOML document encoders need that room to maneuver.

I'll be opening a PR which is intended to curb this abuse. But I'll also open a separate issue around the syntax that's been discussed here, because the idea of parenthetical comments may be worth considering. Neither of these things is intended to address notions of type syntax (which in TOML is completely determined by existing value syntaxes), but I will still refer to this issue when I make them since these proposals stem from the exploration conducted here.

tintin10q commented 1 year ago

TOML is not a programming language

I agree, TOML is not a programming language and it should not be. It is an input language. However, the parsers are written in programming languages. I believe that if arbitrary tags were added, TOML could have become a programming language because with arbitrary behavior it was essentially undefined what a parser should do when encountering a tag which means it was up to the parser to decide and the parser could decide to do anything. But perhaps this is not the same definition as undefined as on the Wikipedia I linked.

Allowing the possibility for arbitrary behavior in the specification could make it seem like we encourage abuses of the syntax.

You understand my view although I don't think encourage is the right word. Even if you would explicitly discourage abuse in the spec, the ability to do so would be there and that will go wrong at some point with people wanting to do 'clever things' with their parsers and then we get to You really should only have to look at the TOML spec to know what your TOML file will do..

About the comments violations. I agree that somehow preserving the comments is nice. Otherwise they would all be removed from the file when you would read and write back a TOML file. However, this is less of an issue than the arbitrary tags because comments do not have to be parsed any further as they are just strings and should stay strings but clearly defining how parsers should deal with comments further is of course a good idea.

and some consumers may read those comments and make changes to their configurations

With consumers do you mean a parser implementation or an application? I think that if you mean an application than this is not that bad. Although I wouldn't that it is a good idea it is still the application making the choices not the parser. If you do mean the parser than I would say that that parser is just not compliant with the TOML spec and being too clever.

With something like a #html tag or #base64 tag or #url tag the parser will probably call other peoples parsers to do this. This increases dependencies and the attack surface as people could also supply tags that the application doesn't actually expect which could be a security issue. Unless specified in the TOML spec it should be up to the application to how parse things further and not up to the parser implementation.