toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.29k stars 844 forks source link

Array syntax idea #1031

Open MarcusJohnson91 opened 1 month ago

MarcusJohnson91 commented 1 month ago

Name = PNGTests NumCases = 1

[Case.0] State = TestState_Enabled Outcome = Outcome_Passed Size = 138 # Num hexadecimal digits, divide by 2 to get the number of bytes. Data = 0x[89504E470D0A1A0A0000000D4948445200000001000000010001000000376EF9240000000A49444154000078016360000000020001737501180000000049454E44AE426082]

No whitespace, commas, or anything except hexadecimal digits are allowed after 0x[ and before the closing ].

It makes parsing binary data stored in a toml file MUCH easier and faster to process.

I don’t see why binary and octal couldn’t have the same syntax. 0b[ and 0o[ respectively, but I have no need for such syntax.

arp242 commented 1 month ago

I assume what you want is that this:

data = 0x[89 50 4e]

will be treated like:

data = "\x89\x50\x4e"

Or something along these lines.

If I had a lot of binary data, I'd just put in in a string like:

data = '''
89 50 4E 47 0D 0A 1A 0A 00 00 00 0D
49 48 44 52 …
'''

And then write a custom parser in your language to deal with that. Should be easy enough in most languages.

Or use an array of numbers, or escapes like above.

This seems far too rare of a use case to add to TOML.

At the very least we'd need a few examples of real-world TOML files that see actual use where this feature would be useful.

eksortso commented 4 weeks ago

Remember that parsers treat things like "\x89\x50\x4e" not as byte arrays. They're strings. In fact, `"\x89" on its own is two bytes when encoded as UTF-8.

The custom parser idea makes more sense in context, but to be honest, I need more context. We do see more computer-generated values, like hashes, in use cases where TOML is not touched by humans. But invariably that binary data is expressed as a hex string. It's similar to the proposed 0x[] syntax in that way. I'm not opposed to a byte array value type, but I'm skeptical that we need it everywhere.

Send us more use cases where byte arrays need their own special syntax. Why would a human-centric format need a value type that emits blobs of arbitrary binary data and skips the checks available to an intermediate string format?

MarcusJohnson91 commented 4 weeks ago

Where are you guys getting the idea that it’s a string?

granted, technically the entire TOML file is a string.

but beyond that it’s just back to back hex digits after the 0x[ and before the ].

No spaces, no escapes. Just hexadecimal digits.

arp242 commented 4 weeks ago

Remember that parsers treat things like "\x89\x50\x4e" not as byte arrays. They're strings. In fact, `"\x89" on its own is two bytes when encoded as UTF-8.

Oh yeah, of course 🤦 I've been dealing with a lot of these kind of (non-UTF8) strings this week and in context switching I just forgot TOML doesn't work like that.

eksortso commented 4 weeks ago

Where are you guys getting the idea that it’s a string?

I did know that what you are talking about are byte arrays. My point was that in TOML, string values are quoted, so custom parsing would still be required to turn such strings into proper byte arrays. This was in response to @arp242, who already replied.

No spaces, no escapes. Just hexadecimal digits.

We need more use cases for binary arrays before we could proceed with creating new syntax, and I implore you and others to provide those use cases.

But even if we proceeded, I would not sign off on such a limited expression set. I would want to allow ignorable newlines and whitespace between the brackets as well, since human beings might want to use the syntax, and most likely such users would want to avoid long lines and unbroken sequences of hex digits, cmiiw.