Ambiguity in numeric literal formatting in test cases

thomasrstorey commented 2 years ago

Howdy! I am working on a Perl implementation of a KDL parser (which I would love to submit for the official 1.0 compliant list once it is closer to completion). One issue I am running into right now is, I am not clear on how to correctly and consistently handle numeric literals such that they comply with the expected output of the test cases. The only note provided regarding numeric formatting in tests/README.md is:

All numbers must be converted to their simplest decimal representation. That means that hex, octal, and binary must all be converted to decimals. All floats must be represented using E notation, with a single digit left of the decimal point if the float is less than 1. While parsers are required to consume different number syntaxes, they are under no obligation to represent numbers in any particular way.

So for example, why does:

node 1e10

Result in:

node 1E+10

instead of:

10000000000

? What is the definition of simplest decimal representation? If a parser is under no obligation to represent a number in any particular way, what is intrinsic to the value derived from 1e10 that means it should be represented with E notation instead of as a simple decimal integer?

Thanks!

zkat commented 2 years ago

I believe these instructions are for how to run things against the test suite, because of the specific output the test suite expects. You can represent things and even present them in your own API however you want. But your test-specific printer has to represent things in this way for the test suite to work. Does that make sense?

Likewise, there's a ton of flexibility in how your parser handles numbers. You can keep hex numbers as hex numbers (for example, if you're writing a parser that's meant to edit kdl files programmatically), etc.

thomasrstorey commented 2 years ago

Thanks for the quick response! That does make sense, but I am not sure if it answers my question... After studying the test cases and the wording a bit more closely I will try express my confusion more clearly. :)

for no_decimal_exponent.kdl, the input is:

node 1e10

and the expected output is

node 1E+10

However, for positive_exponent.kdl, the input is

node 1.0e+10

and the expected output is

node 1.0E+10

Now, my parser can parse both inputs fine, but they effectively result in identical internal values. So when they print, they both come out as 1.0E+10, which is wrong. Does my pretty printer need to remember that the original parsed string was formatted with a decimal point and use that to inform the output format? Or do I need to allow for controls in my pretty print API and choose to print certain values a certain way on a per-test case basis?

tabatkins commented 2 years ago

Yes, we could use a little more detail in precisely how the test suite expects numeric output to work; I had to do some slightly non-trivial stuff to hold onto enough information to output in the test's expected output, and put that output form behind a flag so the default is a little simpler.

It looks like I'm passing those two tests in particular by accident, just because I'm relying on Python's native number parsing and printing, and it distinguishes ints from floats and prints them differently. I'm wondering if we should canonicalize those in the test suite to one form or the other?

tabatkins commented 2 years ago

In particular, right now the test suite expects:

numbers to be canonicalized to Decimal
the presence or absence of a decimal part and an exponent part to be preserved
the precise value of the exponent, if present, to be preserved (no canonicalizing to, say, engineering notation's "divisible by 3" exponents)
the exponent, when printed, always uses "E", and has a "+" or "-" in front of it

Aside from the fact that 1.0 needs to be preserved in that way, insignificant zeros are never preserved. It probably makes sense to relax that final case and instead expect insignificant zeros to always be omitted?

thomasrstorey commented 2 years ago

Thanks @tabatkins those expectations and your explanation of your approach are very helpful!

I would like to avoid tightly coupling the behavior of my pretty-printer to the specific formatting observed by my parser, and would rather control the output of my pretty-printer explicitly by configuration. It sounds like there isn't any specific reason not to do so, as long as the parser can parse the input files and is capable of representing them in the way shown in the expected output.

tjol commented 1 year ago

FWIW, in ckdl I don't keep track of significant digits, or of whether scientific notation was used, but, for the purposes of running the test suite, configured float output to:

Always print a decimal point
Use a capital E for the exponent, if present
Always print a + or - after the E, if present
Use an exponent starting at E+/-4

This covers all tests except no_decimal_exponent.kdl, which happens to rely on keeping track of significant digits. (That test I've excluded from the ckdl test suite)

kdl-org / kdl

Ambiguity in numeric literal formatting in test cases #252