toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.56k stars 858 forks source link

Add syntax for hexadecimal floating point values #562

Open nolange opened 6 years ago

nolange commented 6 years ago

C11/C++11 addedhexfloat as alternative floating point representation. The underlying reason is that machines represent floats in base 2, and the other output formats are base 10 which is a significant issue where you need to calculate with huge integer numbers to guarantee correct results. Look at some implementations like double-conversion for reference, some even use heap allocations for the large integers required. The new hexfloats are trivial to parse.

TOML would be a nice option for small and embedded systems aswell, easily reading in float values would help alot there. So please add them, I believe the addition to be conflict-free to existing TOML files (just as it was added conflict-free to the C/C++ standards)

Format description is taken from floating_literal.

4) Hexadecimal digit-sequence representing a whole number without a radix separator. The exponent is never optional for hexadecimal floating-point literals: 0x1ffp10, 0X0p-1
5) Hexadecimal digit-sequence representing a whole number with a radix separator. The exponent is never optional for hexadecimal floating-point literals: 0x1.p0, 0xf.p-1
6) Hexadecimal digit-sequence representing a fractional number with a radix separator. The exponent is never optional for hexadecimal floating-point literals: 0x0.123p-1, 0xa.bp10
SiebrenW commented 6 years ago

I would like to second this. Rounding errors because of the decimal to float and vice versa have bothered me for ages. This may also raise the question whether we need a way to define decimals rather than floats through a decimal input for languages that support those.

nolange commented 5 years ago

Ping?

Is there anything I can do tho speed this up?

pradyunsg commented 5 years ago

As I noted in #617, we won't introduce new syntax in TOML 1.0 on top of TOML 0.5.0. While I do see the benefits to this, I'm going to defer this until after TOML 1.0.

graza commented 5 years ago

Copied from #617 since this comment is more appropriate in this issue.

Why not use decimal floating-point? It's kinder to humans (and isn't that the point of TOML?).

Modern processors are now starting to support the decimal floating-point standard defined in IEEE 754-2008. For example the Intel C compiler now has support for the data types and functions to check the support status.

The C data types in the standard are _Decimal32, _Decimal64. and Decimal128. Constants/literals in C use a suffix to denote which data type the value is to be represented in. I believe the suffixes are as per this proposal in 2008:

These denote 32/64/128 bit floating-point values, and with the 'd' or 'D', represent a decimal floating-point value.

For TOML, I think this could go one of two ways. Either:

  1. When parsing a TOML document, the parser is instructed to return decimal floating-point when a floating-point number is parsed in the document, or
  2. TOML format is updated to support the suffixes.

The first is backward compatible. The second is new-syntax.

nolange commented 5 years ago

Copied from #617 since this comment is more appropriate in this issue.

Why not use decimal floating-point? It's kinder to humans (and isn't that the point of TOML?).

You cant represent the the whole range of IEE754 floats accurately. I agree that's its nicer to humans, but you will lose the ability to accurately define numbers. I would like to also be able to use toml for some embedded projects, and the base10 ->float conversion prevents that.

Modern processors are now starting to support the decimal floating-point standard defined in IEEE 754-2008. For example the Intel C compiler now has support for the data types and functions to check the support status.

no desktop processors supports decimal floating point so far.

The C data types in the standard are _Decimal32, _Decimal64. and Decimal128. Constants/literals in C use a suffix to denote which data type the value is to be represented in. I believe the suffixes are as per this proposal in 2008:

  • f d l F D L
  • df dd dl DF DD DL

These denote 32/64/128 bit floating-point values, and with the 'd' or 'D', represent a decimal floating-point value.

For TOML, I think this could go one of two ways. Either:

  1. When parsing a TOML document, the parser is instructed to return decimal floating-point when a floating-point number is parsed in the document, or
  2. TOML format is updated to support the suffixes.

The first is backward compatible. The second is new-syntax.

Thats a very different usecase than this issue (which is just another syntax for the already supported IEEE754 floats, a program using a TOML parser would just get back a double and not a new type), and since a float is rather different than a "decimal" I would not pick 1. Further you could also just pass the string value for decimals, instead of introducing another type.

graza commented 5 years ago

no desktop processors supports decimal floating point so far.

I guess I come from a billing and revenue management system background where server class processors are the norm, decimal values are required, but it would be nice to use floating point for performance. Binary floating-point values don't suit because they can't accurately represent decimal values. Having to convert strings to decimals is a nuisance when the file format could just support it directly.

But I accept that my comment is off-topic. In domains other than my own, such as scientific processing, I'm sure hex floating-point would be very useful.

nolange commented 5 years ago

no desktop processors supports decimal floating point so far.

I guess I come from a billing and revenue management system background where server class processors are the norm, decimal values are required, but it would be nice to use floating point for performance. Binary floating-point values don't suit because they can't accurately represent decimal values.

I am not from this background, but I would have expected that fractional fixed-point math (like using integers representing 1/1000 of a currency) would be used. you don't need huge exponents, do there is no real need for a exponential format? What server-class CPUs do you use, I just googled that POWER6 supports DFP in Hardware, and I am actually surprised that there are any CPUs supporting it.

Having to convert strings to decimals is a nuisance when the file format could just support it directly.

That's nitpicking, but a conversion has to happen somewhere. The bigger issue is if the conversion is lossy (which is generally the case if you convert between base2 and base10).

But I accept that my comment is off-topic. In domains other than my own, such as scientific processing, I'm sure hex floating-point would be very useful.

you cant directly specify (base2) values otherwise, which is basically used everywhere.

arp242 commented 1 year ago

There are some common languages that support this natively, and some that don't; it's about 50/50:

Support: C, C++, Java, Go, Zig, Swift, Lua, Perl, Python, Haskell (GHC extension)

No support: [C#], Ruby, Rust, PHP, JavaScript, TypeScript (no spec, but I tried it and it didn't work)

This means that implementations for those languages will have to parse these strings to a float type, and can't "just" use the stdlib parse_float() or whatnot (assuming the languages with native hex float literals have a stdlib function for that; I didn't check but I assume that most or all do). This is not hugely difficult, but not trivial either. It's also not clear to me if all languages use exactly the same syntax (I didn't investigate the specifications in detail, just a quick yes/no).

I think it's a rare enough feature that the added complexity for implementations isn't worth it, but I'm not dead-set against it.

[C#]: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/language-specification/lexical-structure#6454-real-literals

marzer commented 1 year ago

but not trivial either. [...] the added complexity for implementations isn't worth it

TBH I'd say parsing hexfloats manually is trivial for people who are already writing a parser (given the necessary experience they'd already have); this is toml++'s implementation:

https://github.com/marzer/tomlplusplus/blob/e6d1958f923c16ee2b12510c16d7265d1e2e0d8e/include/toml%2B%2B/impl/parser.inl#L1957-L2131

It *looks* like there's a lot going on, but most of it is error handling - the actual hexfloat parse logic is only about 40 lines. I'd be happy for any implementers to use my implementation as a starting point.

arp242 commented 1 year ago

the actual hexfloat parse logic is only about 40 lines

That's roughly the number I had in mind, so nice to have that confirmed. It's not hard-hard, but it's not "just a few lines added in 10 minutes" either, which is what I meant with trivial. For comparison, the parsing code (excluding lexer) is about 800 lines for BurntSushi/toml right now, so ~40 lines is comparatively a lot for a little-used feature (will also need to tweak the lexer a bit).

On the other hand: it's also the case that some values are hard or even impossible to express without this, but a number of popular programming languages seem to fare well enough without it, including some lower-level ones like Rust. Personally I've never needed this in my entire 25-year programming career, but I've also done limited low-level stuff.

It's a trade-off; personally I'd lean towards "it's not worth it", but it's a small lean.

Do you know of people using that feature in your library by the way? I couldn't really find anything on your issue tracker, or TOML files using it with GitHub code search: https://github.com/search?q=%2F0x%5Ba-fA-F0-9%5D%2B%3F%5C.%3Fp%5B-%2B%5D%2F&type=code

marzer commented 1 year ago

Do you know of people using that feature in your library by the way

Me, in a private project :)

eksortso commented 1 year ago

There are some common languages that support this natively, and some that don't; it's about 50/50:

You'll want to move Python to the Support camp, because float.hexfloat is in the standard library.

nolange commented 1 year ago

I think it's a rare enough feature that the added complexity for implementations isn't worth it, but I'm not dead-set against it.

The smaller you get, the less likely it is that your library function will end up with the same value after parsing, see for example https://keithp.com/blogs/picolibc-string-float. I am not sure if even standards-compliance will guarantee the same value after round-trips, and "arbitrary-precision" implementations need much code and ram. Look at the popular [STM32] MCUs(https://www.st.com/en/microcontrollers-microprocessors/stm32f3-series.html) for example - performant FPU but couple KB ROM/RAM.

The feature you would be adding is lossless data exchange across platforms.

arp242 commented 1 year ago

@nolange the question isn't "is this useful in some cases?" but rather "is this useful enough"? Almost every feature is useful, but also comes with a cost for implementations, complexity, etc. TOML could easily be three times the size if we included every feature that was occasionally useful.

It's not really clear to me that many people want to use TOML on STM32 systems, and everyone else on a less constrained system can use a string if they really need this. It's not clear to me what @marzer's use case is exactly, but thus far it seems you're the only one who really needs this.

You'll want to move Python to the Support camp, because float.hexfloat is in the standard library.

Cheers, I updated it. Also turned out that Haskell supports it as a GHC extension.

marzer commented 1 year ago

@arp242

It's not clear to me what @marzer's use case is exactly

It's a serialization problem. Sometimes floats need to be exactly round-tripped via a TOML file, and TOML (or, more specifically, the float<->string utilities provided to the TOML implementer by their language) won't necessarily guarantee that since the conversion to decimal form during serialization (i.e. formatting the data as TOML), and then at the other end from string->float during deserialization, can lead to precision loss issues. Keeping it as hexfloats at every step mitigates that.

You could argue that TOML isn't really the right tool for this, and you'd be right in my particular case (I could have handled it differently but re-using my existing TOML infrastructure kept things very simple). It's a tricky case because the majority of people will never use it, but those who do, tend to really need it. TOML's inclusion of dates and times also falls prey to this dichotomy, methinks.

edit: wording

ChristianSi commented 1 year ago

@marzer 's use case is interesting. Until yesterday I considered this as "neat, but too specialized to warrant the additional complexity." However, while I know TOML is chiefly meant for configuration, I think it can useful for other use cases too, and serialization is a compelling one. Moreover, it's a kinda logical extension that brings the ways of writing integers and floats into better alignment.

These are not all that compelling arguments, and the "additional complexity" arguments still weights against it, but you can now count me in the "slightly in favor" rather the "slightly against" camp.

nolange commented 1 year ago

@nolange the question isn't "is this useful in some cases?" but rather "is this useful enough"? Almost every feature is useful, but also comes with a cost for implementations, complexity, etc. TOML could easily be three times the size if we included every feature that was occasionally useful.

In terms of complexity, parsing/printing the "normal" float format will overshadow everything else. I get that this doesn't matter if those routines are available already, if they aren't then the costs are high.

It's not really clear to me that many people want to use TOML on STM32 systems, and everyone else on a less constrained system can use a string if they really need this. It's not clear to me what @marzer's use case is exactly, but thus far it seems you're the only one who really needs this.

Everyone that cares that a float can be stored and retrieved without a loss. Naturally you would use a really simple one like ini files with your own conventions - because there arent good alternatives. I see config files as pure data storage, and wouldn't want any lossy conversions. Whether your conversion is lossy depends on the underlying library - C/C++ is very lax on how the functions should be implemented (there are only recommendations).

Means you might get different results for reading a toml file containing decimal floats depending on the language/standard library used. I know you say this is an additional feature, but I am not sure it's understood that this is something functional and not just "cosmetic". (some more esoteric: you can then even store/retrieve all bits for NaNs's, incase you want to dump internal state)

marzer commented 1 year ago

As if by divine providence, here's an article published today by the author of C++'s fmt library (and by extension, C++20's std::format) that details some of the pitfalls people can fall into when trying to serialize floating-point data to strings:

https://www.zverovich.net/2023/06/04/printing-double.html

(you'll note that towards the end of the article, hexfloats are listed as a more robust alternative in the absence of sensible float -> decimal string machinery)

arp242 commented 1 year ago

marzer 's use case is interesting. Until yesterday I considered this as "neat, but too specialized to warrant the additional complexity." However, while I know TOML is chiefly meant for configuration, I think it can useful for other use cases too, and serialization is a compelling one. Moreover, it's a kinda logical extension that brings the ways of writing integers and floats into better alignment.

You can use string representations for this (e.g. f = "0x1.a8c1f14e2af5dp-145") which should be "good enough" for almost all uses cases except the ones with limited memory, which is what nolange mentioned.

Imaginary numbers can't be represented in TOML either for example, other than using a string (or an [int, int] array, I guess). While this is certainly useful in some contexts, using strings is "good enough" for most use cases IMHO.

My sense of "taste" says this doesn't need to be in TOML, which is not very tangible, and I can definitely understand the arguments in favour of it. If someone would make a concrete PR then I probably won't vote against it (probably! Not a promise!)

marzer commented 1 year ago

You can use string representations for this (e.g. f = "0x1.a8c1f14e2af5dp-145") which should be "good enough" for almost all uses cases except the ones with limited memory, which is what nolange mentioned.

Yeah, I guess, but then why not just do that for everything? Why have a bool type, when enabled = "true" is 'good enough'?

Plus, from a user perspective it's inconsistent and error-prone; the neighbouring int, float and boolean KVP's are happily quote-free, but don't forget the quotes around this special magic one! =/

arp242 commented 1 year ago

Yeah, I guess, but then why not just do that for everything? Why have a bool type, when enabled = "true" is 'good enough'?

Because lots more people want to use enabled = true and as far as I can tell a very small number of people want to use these hex literals. That's basically the difference. Obviously you need to draw a line somewhere, and it's a matter of "taste" where exactly to draw that.

eksortso commented 1 year ago

Personally, I would prefer to see literal decimal syntax be adopted first. But I wouldn't object to the introduction of hexfloats, given proper limits.

We can only guarantee so many hexadecimal places are preserved in practice. At the barest minimum, three hex digits would allow "thousandths" (the standard we use elsewhere for decimal accuracy) to be represented accurately, but I'm assuming most use cases would use many more than three digits.

erbsland-dev commented 1 year ago

I personally view TOML more as a user-friendly configuration format, distinct from XML or JSON. These latter formats tend to be challenging for humans to read and write without the aid of special tools, making them less ideal for configuration.

While I appreciate the advantages of storing floating-point values, especially as a C++ and embedded developer, I believe this practice is not a common occurrence in a configuration file. Instead, it seems more applicable to data serialization.

As I thought about situations where I might employ hex floats, I found the following:

From this list, I personally feel that only the first point holds minor relevance in the context of software configuration. The last three scenarios seem more related to code execution or debugging rather than configuration.

I think rounding errors become highly significant when dealing with small floating-point values, such as a 32-bit float in C++. This indeed frequent occurrence in embedded development. However, the TOML specification dictates that floats should be realized as IEEE 754 binary64 values. Consequently, a floating-point value would inevitably need to be converted from a 64-bit to a 32-bit value, leading to potential issues. These problems could only be addressed if TOML provided support for floating-point values of different sizes.

In conclusion, from my perspective, for the few cases where precise representation of floating-point values is essential, storing these values as strings seems acceptable.

Moreover, I believe it's crucial to remember the human-first nature of TOML. Personally, I find the hex float format challenging to comprehend. I struggle to convert the fractional part of a hex float into decimal form mentally. In my view, this lack of intuitive understanding contradicts the underlying premise of TOML, which prioritizes human-readability and interaction.

nolange commented 1 year ago

Moreover, I believe it's crucial to remember the human-first nature of TOML. Personally, I find the hex float format challenging to comprehend.

Yet you support hex, octal and binary integers and nan. Whats "human-first" depends on what you consider the source, if one deals with binary mantissa + exponent, then that's the most fitting. There's a certain unease if you have to consider whether your variable needs it source specified 'by-bit' and forced by the lack of spec support as string, vs just declare your variable a float and chose the most obvious fitting representation based on value.

Neither the 32bit floats nor the 'small' floating-point values are an argument, the problem is always:

In my view, this lack of intuitive understanding contradicts the underlying premise of TOML, which prioritizes human-readability and interaction.

Don't see how another way to write a float value is different to allowing an int to be specified as hex, you still can write your "readable" values and allow values that are way easier to read as base2/base16 mantissa / exponent too. Conversely I now would parse variables as string instead of float if I expect the precision could be an issue.

eksortso commented 1 year ago

We talked about this feature on and off for 5 years. Is it worthwhile? I could see a use for it. Is it minimal and obvious? For human beings, not so much, though for machine-generated TOML it's more obvious for accurate floats, lacking better alternatives.

Should we include it as part of TOML v1.1.0? No.

Let's defer it. Tag it "post-v1.1.0", and come back to it for a possible future version.

pradyunsg commented 1 year ago

We don't need a tag for this -- a PR for this would be welcome, but it likely won't be merged for a TOML 1.1.0 release. :)