toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.51k stars 853 forks source link

Reconsider hex and/or octal integer formats #409

Closed rmunn closed 6 years ago

rmunn commented 8 years ago

Issue #53 was closed in June 2014, because the decision at the time was to prefer simplicity of implementation. So because 0xff00ff or 0o755 were slightly harder to write parsers for than 16711935 or 493, the choice at the time was not to allow hex or octal numbers in TOML.

However, since that time, issue #263 has been decided the other way. Datetime values are non-trivial to parse, but are highly useful in some scenarios. So the decision was made to keep them in, because they are useful to some real users.

These two decisions are inconsistent. If datetimes are going to be in TOML, the same arguments can be (and have been) made for hex and octal representations of numbers, which are a lot easier to write a parser for than datetimes. Most languages already have a hex parser implementation that TOML parsers could take advantage of. And in any language that doesn't, parsing hex values is not complex. It's a problem with "Coding 101 homework" levels of difficulty, not "doctoral thesis" levels of difficulty.

And hex and octal values are useful in many scenarios that TOML is intended for, such as config files. Unix permissions use octal values: 0o755 is much easier to mentally translate to u+rwx, g+rx, o+rx than 491. Or was it 493 or 495? Quick, can you tell which of those three decimal values is the correct conversion of 0o755? I can't without a calculator, and I'd much rather see 0o755 in config files. Hex, of course, is highly useful when dealing with colors or bit flags. Neither are as common in config files as octal, but if we allow octal there's no good reason not to allow hex.

Therefore, I would ask that #53 be revisited, either be reopening that issue and having the discussion there, or by starting a new discussion here. The reason for closing #53, to keep things simple for TOML implementations, has been abandoned by now, and there's no longer any reason not to allow hex and octal values.

rmunn commented 8 years ago

Also, I'll repeat the comment I made on issue #53 last month: if octal values are included, PLEASE don't repeat C's mistake, as so many programming languages have. A leading 0 in an integer should not change its meaning. The format for octal should parallel the format for hex: 0x123 for hex, and 0o123 for octal. (And 0b101 for binary, if binary is allowed.)

Additionally, I would recommend that the only integer format markers allowed be lowercase: 0x, 0o, and possibly 0b. In particular, I suggest that 0O (digit zero followed by capital letter O) be forbidden by the spec. It's too easy to mistake those two characters for each other in many fonts.

As for hex digits, either they should be lowercase-only (to match how 0x is the only format marker allowed) or they should allow mixed-case; I haven't made up my mind which would be better. I prefer to use lowercase hex digits myself, but some people prefer to see 0xDEADBEEF rather than 0xdeadbeef, so if I were making the decision, I would choose to allow mixed-case in hex digits. It doesn't complicate implementations much, and it lets people follow their preference.

Finally, the question of negative numbers comes up. What should -0x123 mean? What about -0xffff? What about -0xffffffffffffffff? I would recommend that negative numbers should only be allowed in decimal representation, and should be forbidden in hex, octal, and binary representations. So -0x123 would be an error, rather than converting to -443 in decimal.

acasajus commented 8 years ago

Also #54 got accepted

FranklinYu commented 8 years ago

I think this is exactly what @BurntSushi was worried about when he hesitated to agree #54: allowing any of the features is unfair to all the others, but allowing all of them would make this language no longer "minimal". I would suggest that this feature be included in the standard after most of the available parsers have implemented it, not the other way around; at that time it would be easier for @BurntSushi to decide to merge this feature into standard.

rmunn commented 8 years ago

Letting the parser implementations drive the spec might be a good idea, but OTOH, that's how we got the mess that is Javascript. So while I agree that it would be good for parsers to implement this proposal (and it should be easy to implement since most languages have a built-in ability to parse ints in bases other than 10), I think we should also have a discussion about how the spec should specify it. In particular, I think it's VERY important to hash out how octal should be represented -- should a leading 0 signify octal, as it does in C? Or should the 0o755 syntax be the only way to specify octal numbers? That is a discussion that needs to happen in the spec, so that we don't have parsers implementing this in two mutually-incompatible ways. (I have a STRONG preference for the 0o syntax, as I've long felt that it was a major mistake in the C language to have leading zeroes in code change the meaning of an int literal. But if lots of people feel otherwise, then we should go with the Principle of Least Surprise.)

Also, as of this writing, a total of 9 unique people have reacted with thumbs-up emoji on either this proposal, or on my March 30th comment on #53. So far I have not seen a thumbs-down emoji or a "We shouldn't do this" response to me. Both @BurntSushi and @mojombo said "Not sure we'll need this, and it would complicate parsers" to the original proposal, but haven't yet responded to this one.

And since their original "Not sure we'll need this" response was a good one, and there does need to be a use-case to justify the extra work for parser implementors, here's a summary of the use cases:

Hex - colors (#ff00ff) and bit flags (0x01, 0x02, 0x04, 0x08, 0x10, 0x20...) seem to be the most likely use cases in configuration files. I could maybe also see specifying "magic numbers" via an array of byte values, e.g. utf8bom = [ 0xef, 0xbb, 0xbf ]. That one's less likely -- but colors and bit flags are probably going to be used a lot, and they really NEED hex to be comprehensible.

Octal - Unix file permissions (0o755, 0o640) really need octal to be comprehensible. Those tend to show up pretty often in configuration files, which seem to be one of the uses TOML is intended for.

Binary - No obvious use case. MAYBE some utility for bitmasks: 0b11111100 is slightly more obvious about which bits it masks out than 0xfc. But if a parser has implemented hex and octal (both of which do have genuine use cases that need them), then the extra cost to implement binary is trivial (in nearly every language, binary-parsing will be basically a copy and paste of octal-parsing, with 8 changed to 2).

lmna commented 8 years ago

Hex could be a poor man's surrogate for MAC addresses, IPv6 addresses, public key / certificate fingerprints, RFC 4122 UUIDs . Underscores could be put in place of colons and dashes. Note that ipv6 addr and uuid are 128 bits long.

FranklinYu commented 8 years ago

How about having a "lowest standard" without any advanced feature, while keeping a "suggested standard" for all the advanced features? Something like "we do not impose this requirement on your implementation, but if you do want this feature, then implement it as below..."? Most of the advanced features, including Hex/Oct literals, Date/Datetime literals, all serves as extension to the standard: anything satisfying the "lowest standard" will still be parsed as expected even by a parser supporting such advanced features.

It might be a bad example, but it reminds me of C vs C++ (bad example because actually not all C codes can compile as C++ code).

I am wondering how @BurntSushi and @mojombo like this idea.

update

A better example is the Scheme specification branching to two specification: R7RS (small) to keep minimalism and R7RS (large) for more functionality.

tshepang commented 7 years ago

So, to summarise, add these 3 ways to represent numbers:

Have these rules:

Am I missing anything?

rmunn commented 7 years ago

That's all I wrote. I just noticed that I didn't mention underscores between digits, the way the spec allows for decimal integers. For consistency's sake, I think underscores between digits should also be allowed in hex, octal and binary as well, especially since that is what is allowed in languages like Java and F#. So if underscores are allowed in decimal numbers and not in hex/octal/binary, then that will violate the principle of least surprise.

timbunce commented 7 years ago

Having just come across TOML I was delighted by everything until I noticed the very odd omission of hex literals (and octal and binary by extension). In cases where such values are natural, trying to use anything else goes directly against TOML's "easy to read" objective. Well, to be sure, 16711935 is easy to read but the meaning is much more clearly expressed as 0xff00ff.

rmunn commented 7 years ago

One further refinement of my design suggestion: underscores should be allowed between digits, but NOT inside the 0x / 0o / 0b prefix of a hex, octal or binary number. I.e., 0_xdeadbeef is not acceptable, but 0xdead_beef is allowable. Allowing 0_x, etc., would just make parsing harder for no benefit.

I have not yet decided whether underscores should be allowed between the prefix and the first digit of the number; technically, the x or o or b of the prefix is not a digit, so if we want consistency with the decimal-numbers rule "each underscore must be surrounded by at least one digit", then we wouldn't allow that. But once the 0x has been parsed, the rest is unambiguous, and I could see someone wanting to write 0x_ff_00_ff_80 to represent the RGBA color "magenta, 50% transparent". So I lean towards allowing underscores between the 0x / 0o / 0b prefix and the first digit of the number.

rmunn commented 7 years ago

I've looked further at two existing languages that allow underscores in number literals (Java and F#). In both of these languages, as in TOML, the underscore may appear ONLY between digits, and they do not count a base prefix (0x, etc.) as being a digit for this purpose. In other words, 0x_12_34 is a syntax error in both F# and Java since an underscore may not appear immediately after the base prefix.

To follow the principle of least surprise, I have therefore decided that my TOML spec proposal will use the same rule as Java and F#. So underscores MUST NOT appear immediately after the base prefix. The 0x_ff_00_ff_80 example I gave in my previous comment will be invalid, and would have to be written as 0xff_00_ff_80 to be valid.

guai commented 7 years ago

Octals are useless. I see no usage of them except of one single case - unix file rights. But even unix have more userfriendly option with u o g mnemonics Modern languages tend to not have octals cause we dont have 16bit platforms anymore

rmunn commented 7 years ago

Octals are almost useless for everything except Unix file permissions, yes — but that's a major use case, and sufficient justification all by itself for including them. The letter-based permissions can be easier to read in some cases (especially for people who don't use Unix very much), but experienced Unix admins find 0o644 to be perfectly easy to read. In fact, I personally would rather express permissions as 0o644 than the equivalent u=rw,g=r,o=r, and plenty of experienced Unix users feel the same way. Note, for example, how the examples for setting file permissions in Ansible did not feel the need to explain that mode 0644 means u=rw,g=r,o=r — but they did feel the need to explain that the textual mode u=rw,g=r,o=r was equivalent to 0644, because the octal notation is more familiar than the textual representation to anyone who uses Unix a lot.

(There's actually another decent reason to use octal, and that's to more easily spot UTF-8 multi-byte sequences in Unicode data, but that's not a use case for TOML. I'm just mentioning it for curiosity's sake.)

guai commented 7 years ago

@rmunn, how about express this single case with just strings like "0644"? Octals in form of 0NNN tend to cause problems when people know nothing about octals, but know math :) where leading zeros can be omitted. 0oNNN are poorely recognizable when people have problems with their sight.

wbober commented 6 years ago

Any progress on this?

@rmunn I'd like to comment on the use cases from a hardware developer perspective. I'd like to use toml as a configuration language for a test rig. Hex and binary are very useful when you deal with hardware, for example, hex is used to refer to memory or register addresses. Binary is very useful when you deal with register values.

guai commented 6 years ago

On octals: here are some relatively new langs that decided not to have them.

tshepang commented 6 years ago

@guai Octals should be not ambiguous, if you prefix them with 0o, not just 0.

BurntSushi commented 6 years ago

As requested by https://github.com/toml-lang/toml/issues/330#issuecomment-347202526, I'll weigh in.

First and foremost, this is a backwards compatible addition, since all conforming parsers today will return an error if a user types a hex/octal literal as proposed here. Therefore, there is no particular reason to render a verdict now.

Secondly, I'd personally be in favor of adding at least hex. Octal seems useful for file permissions. @mojombo what do you think?

rmunn commented 6 years ago

Many other new languages have decided to allow octal, but to settle on the 0o prefix instead of the confusing leading zero. https://en.wikipedia.org/wiki/Octal lists Haskell, OCaml, Perl 6, Python 3, Ruby, Tcl 9, and ECMAScript 6 as all supporting octal written in the 0o syntax. So 0o is basically the standard way of doing octal these days.

BurntSushi commented 6 years ago

I agree. If we do octal, we should use a 0o prefix.

guai commented 6 years ago

I definitely agree that 0o is way better than just 0 prefix in most sane fonts at least. But I still see not much of a usage at all because there are no 16 bit platforms out there. The only usable example mentioned itt is unix fs access rights. And if there are exactly one use case isn't it better to support it explicitly in a form of mnemonics?

tshepang commented 6 years ago

@guai what is mnemonics?

pradyunsg commented 6 years ago

I'd like to see hex and octal literals. The former is common in for representing multibit values and the latter is used for single bit values (like Unix permissions).

0xDead_Beef and 0o644_000 look good to me too. :)

guai commented 6 years ago

@tshepang, it would be something like flags = rwxr-xr-x or flags = u=rwx,go=rx instead of hm... flags= 0o755 I guess

tshepang commented 6 years ago

I find mnemonics more clumsy, and they feel not justified to have support for them (use strings). OTOH octals are more general, and there probably is some other use for them beyond unix file permissions.

rmunn commented 6 years ago

@guai - For the Unix access permissions use case, octal numbers are more widely used than the mnemonics, and especially in config files. I can't point you to any evidence for this assertion, since AFAIK nobody has done a statistical analysis. But in my experience, you'll see a lot more chmod 0755 commands than chmod u=rwx,g=rx,o=rx. It's more compact, just as easy to read for any experienced Unix user (and if they're setting permissions in config files, they're probably experienced) and it's what Unix users have come to expect.

And I'd be against a special-case flags = rwxr-xr-x or flags = u=rwx,go=rx example that would translate into numbers. Far too specialized; if a config file wants to allow that, strings are a much better use case there.

guai commented 6 years ago

@rmunn, its just that sort of crazyness everyone got used to.

experienced Unix user

And who would be an average toml user? In neighbor thread I was told, that concept of empty path is not obvious enough for toml, but that is the thing well known to every user familiar with any filesystem too

rmunn commented 6 years ago

@guai - The fact that you said on March 30 that "octals are useless" when they're used in just one use case (Unix file permissions) makes me think that you do most of your development on Windows. Is that correct? If so, you have relatively little experience with Unix, so you wouldn't know just how much more often the octal-number format is used in Unix permissions than the text format. But here's one data point to help convince you: I've been using Linux for about 20 years now, and I can look at permissions like 775 or 644 and tell you exactly what they mean. But every time I try to write the mnemonic permissions, I have to stop and say "Does the o in o=rx mean 'owner', or 'other'?" And then I have to look it up.

Anyway, I've made my point so it's time to move on to a different topic: should binary numbers (0b1101) be included as well?

Pro: Consistency, a.k.a. the "why not?" argument. If you've already written code to handle hex and octal numbers in your parser, handling binary numbers is trivial to add. Con: Not often needed, a.k.a. the "why?" argument.

I was thinking that binary should be dropped from the proposal, but then @wbober mentioned an actual use case: config files for driving a hardware test rig. When you're writing a file to send a specific set of binary digits to a connector, and the connector's pins are numbered from (say) 0 to 15, it's easier to use pinout = 0b1101_0011_0111_0010 than pinout = 0xd374. The binary version of that number will let you see at a glance whether pin 12 has a high signal (1) or a low signal (0), whereas the hex version requires you to do a conversion in your head.

So since there's a real user with a real use case (and because the cost of implementing binary format is trivial once you've added hex and octal formats), I'm now in favor of saying "Yes, let's add binary as well".

BurntSushi commented 6 years ago

Folks, I think everything that is going to be said has been said. Let's just sit tight until @mojombo makes a decision.

On Nov 27, 2017 11:09 AM, "Robin Munn" notifications@github.com wrote:

@guai https://github.com/guai - The fact that you said on March 30 that "octals are useless" when they're used in just one use case (Unix file permissions) makes me think that you do most of your development on Windows. Is that correct? If so, you have relatively little experience with Unix, so you wouldn't know just how much more often the octal-number format is used in Unix permissions than the text format. But here's one data point to help convince you: I've been using Linux for about 20 years now, and I can look at permissions like 775 or 644 and tell you exactly what they mean. But every time I try to write the mnemonic permissions, I have to stop and say "Does the o in o=rx mean 'owner', or 'other'?" And then I have to look it up.

Anyway, I've made my point so it's time to move on to a different topic: should binary numbers (0b1101) be included as well?

Pro: Consistency, a.k.a. the "why not?" argument. If you've already written code to handle hex and octal numbers in your parser, handling binary numbers is trivial to add. Con: Not often needed, a.k.a. the "why?" argument.

I was thinking that binary should be dropped from the proposal, but then @wbober https://github.com/wbober mentioned an actual use case: config files for driving a hardware test rig. When you're writing a file to send a specific set of binary digits to a connector, and the connector's pins are numbered from (say) 0 to 15, it's easier to use pinout = 0b1101_0011_0111_0010 than pinout = 0xd374. The binary version of that number will let you see at a glance whether pin 12 has a high signal (1) or a low signal (0), whereas the hex version requires you to do a conversion in your head.

So since there's a real user with a real use case (and because the cost of implementing binary format is trivial once you've added hex and octal formats), I'm now in favor of saying "Yes, let's add binary as well".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/toml-lang/toml/issues/409#issuecomment-347231007, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34r7MTEhOVjkwQTZ-inMDNl8dt_Sjks5s6t55gaJpZM4IRpXK .

guai commented 6 years ago

@rmunn, I have quite a lot of unix experience, but still hate to convert those meaningless digits in my head all the time. Its just bad design, and is still there for legacy reasons.

I think binary is more useful than octal

But there still a question left, will out user experienced enough. If he is an experienced unix user at least, than the point of this topic is ok, but many other decisions were made with less experienced users in mind, I think.

pradyunsg commented 6 years ago

I agree with @BurntSushi. Let's just wait. :)

mojombo commented 6 years ago

Thank you all for your patience and the excellent arguments presented here! It's been a year and a half since this was opened, and as I hoped, time would bear out which features would turn out to be important to real TOML users. I think I've seen enough evidence now that hex, octal, and binary all have reasonable use cases and should be included in TOML as first class citizens. I'll draw up a PR for their inclusion with 0x, 0o, and 0b prefixes respectively.

pradyunsg commented 6 years ago

This issue can be closed. :)

tshepang commented 6 years ago

@pradyunsg why?

pradyunsg commented 6 years ago

Ah. My bad. I thought this was some other issue. :/

mojombo commented 6 years ago

See #507 for the proposal.

rmunn commented 6 years ago

One comment about underscores in numeric literals: my proposal so far has been that an underscore is not allowed between a hex/octal/binary prefix and the first digit of the number. That is, 0x_dead_beef is not allowed, and must be written as 0xdead_beef. I looked at Java and F#, and both of them followed that rule.

I have just learned that C# 7.2 will allow underscores between a prefix and the first digit, so that 0x_dead_beef would be a legal numeric literal in C# 7.2. At the moment, I'm inclined to not change my proposal, and have TOML forbid that syntax, because that's slightly easier for parsers to handle. If Java and F# follow C#'s lead and start allowing 0x_dead_beef literals, then I'd revise my proposal and suggest that the next version of TOML start allowing that as well, for the sake of least surprise.

But it's better to start out strict and then loosen restrictions later, because that keeps backward compatibility. I.e., if the original rule is that 0x_12_34 is not allowed, then everyone will write 0x12_34 in their TOML files, which will still be legal if the restriction is relaxed to allow 0x_12_34. However, if we start by allowing literals like 0x_12_34 and then wanted to shift to disallowing that syntax, then we'd end up invalidating existing config files.

So I recommend keeping the proposal as-is with regard to the underscore rules, but if C# 7.2's slightly looser underscore rules make their way into Java and F# (and other languages that I haven't looked at yet), then we can loosen TOML's underscore restrictions as well, in whatever future version of the TOML spec would be appropriate.