masaccio / numbers-parser

Python module for parsing Apple Numbers .numbers files
MIT License
201 stars 14 forks source link

cat-numbers outputs 0.00016450000000000001 instead of 0.0001645 #51

Closed safinaskar closed 11 months ago

safinaskar commented 1 year ago

I have one .numbers file. Its TileRowInfo.cell_storage_buffer = 6 ( https://github.com/psobot/keynote-parser/blob/7114e3b6594a68d6c6885f469c7b4b3bdc27eb86/protos/TSTArchives.proto#L128 ) for one particular cell contains this data:

[05, 02, 00, 00, 00, 00, 00, 00, 01, 30, 00, 00, 00, c8, 10, a4, 9c, 95, 00, 00, 00, 00, 00, 00, 00, 00, 1c, 30, 01, 00, 00, 00, 0d, 00, 00, 00]

I reverse-engineered this format and wrote Rust program for conversion from .numbers format. In my Rust program I decode this number as 0.0001645. I believe this is correct rendering of this number.

But the following two commands:

cat-numbers --formatting --brief a.numbers
cat-numbers              --brief a.numbers

both produce number 0.00016450000000000001. I believe this is incorrect answer

safinaskar commented 1 year ago

In that Rust program I convert numbers from TileRowInfo.cell_storage_buffer representation directly to my own homegrown fixed-point arithmetic library representation. And then directly convert this number to string. I. e. IEEE floating-point numbers are never involved

masaccio commented 1 year ago

You're using decimal128 types I assume? That's what the underlying representation is. I unpack like this: https://github.com/masaccio/numbers-parser/blob/main/src/numbers_parser/cell_storage.py#L345C14-L353 but as you guessed, go through floats after that.

safinaskar commented 1 year ago

You're using decimal128 types I assume?

Possibly. In my .numbers file these numbers are stored as a * 10**b

masaccio commented 1 year ago

I can workaround this now but I am not able to reproduce using this test file.

masaccio commented 1 year ago

@safinaskar in the absence of tests, I've published an update. The change will break the current API so I've not made this visible by default. To use it with cat-numbers, you will have to say cat-numbers --experimental. In this mode, decimals are decoded using decimal128. I've not benchmarked yet but unless you're doing lots of arithmetic with the results from the API, there should be no difference.

@SheetJSDev - FYI on a future change that will become the default and may change how some of your tests work.

safinaskar commented 1 year ago

Thanks a lot for your work! I don't have currently time. When I have time, I will test your new version on my data. I hope this will happen in 1-2 months. Please, don't make this experimental behavior to default until I will do my tests

safinaskar commented 11 months ago

I got time and so tested numbers-parser. I still don't like how it works. Also, I got access to actual Apple Numbers and was able to produce test cases using it.

So, here are steps to reproduce remaining bugs in numbers-parser.

Create file a.csv with this contents:

0.00016450000000000000
0.0000000000000001
0.0001645

Then open it with Apple Numbers and save as a.numbers. I used Apple Numbers 13.1 (7037.0.101) running on macOS Ventura. Resulting file is attached. Feel free to store it in your tests. As you can see, Numbers preserves trailing zeros, so 0.00016450000000000000 and 0.0001645 are two distinct numbers from point of view of Numbers. (But I think this is not important for numbers-parser to preserve them.)

finalt.numbers.gz

Now let's run numbers-parser 4.2.0 on this file with various flags.

<[sid]>root@0ae577710eac:~# ~/.local/bin/cat-numbers --version
4.2.0
<[sid]>root@0ae577710eac:~# ~/.local/bin/cat-numbers /tmp/finalt.numbers
/tmp/finalt.numbers: Sheet 1: finalt: 0.00016450000000000001
/tmp/finalt.numbers: Sheet 1: finalt: 1e-16
/tmp/finalt.numbers: Sheet 1: finalt: 0.0001645
<[sid]>root@0ae577710eac:~# ~/.local/bin/cat-numbers --formatting /tmp/finalt.numbers
/tmp/finalt.numbers: Sheet 1: finalt: 0.00016450000000000001
/tmp/finalt.numbers: Sheet 1: finalt: 0.0000000000000001
/tmp/finalt.numbers: Sheet 1: finalt: 0.0001645
<[sid]>root@0ae577710eac:~# ~/.local/bin/cat-numbers --experimental /tmp/finalt.numbers
/tmp/finalt.numbers: Sheet 1: finalt: 0.000164500000000000
/tmp/finalt.numbers: Sheet 1: finalt: 1E-16
/tmp/finalt.numbers: Sheet 1: finalt: 0.0001645
<[sid]>root@0ae577710eac:~# ~/.local/bin/cat-numbers --experimental --formatting /tmp/finalt.numbers
Traceback (most recent call last):
  File "/root/.local/bin/cat-numbers", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/_cat_numbers.py", line 128, in main
    print_table(args, filename)
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/_cat_numbers.py", line 97, in print_table
    cells = [cell_as_string(args, cell) for cell in row]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/_cat_numbers.py", line 97, in <listcomp>
    cells = [cell_as_string(args, cell) for cell in row]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/_cat_numbers.py", line 80, in cell_as_string
    elif args.formatting and cell.formatted_value is not None:
                             ^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/cell.py", line 476, in formatted_value
    return self._storage.formatted
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/cell_storage.py", line 284, in formatted
    return self.custom_format()
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/cell_storage.py", line 349, in custom_format
    return format_decimal(self.d128, format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/pipx/venvs/numbers-parser/lib/python3.11/site-packages/numbers_parser/cell_storage.py", line 715, in format_decimal
    if value.is_integer() and format.decimal_places >= DECIMAL_PLACES_AUTO:
       ^^^^^^^^^^^^^^^^
AttributeError: 'decimal.Decimal' object has no attribute 'is_integer'

Everything is wrong here. ~/.local/bin/cat-numbers /tmp/finalt.numbers and ~/.local/bin/cat-numbers --formatting /tmp/finalt.numbers are wrong, because they print first number as 0.00016450000000000001. This is wrong answer.

Also ~/.local/bin/cat-numbers /tmp/finalt.numbers prints 1e-16. I think cat-numbers should never print exponential form. ~/.local/bin/cat-numbers --experimental /tmp/finalt.numbers prints exponential form, too.

And ~/.local/bin/cat-numbers --experimental --formatting /tmp/finalt.numbers fails with backtrace.

Correct output should either be equal to original csv, i. e.

0.00016450000000000000
0.0000000000000001
0.0001645

either be this:

0.0001645
0.0000000000000001
0.0001645
masaccio commented 11 months ago

Thanks for the examples. Very helpful.

Replicating Numbers exactly could be difficult but I will take a look. It uses a non-standard number of bits of precision despite storing these as decimal128 in saved files. A recent example I got shared is how the value 2251799813685247 is rounded 2251799813685250 when you try and enter it.

The original example of 0.00016450000000000001 would be truncated by Numbers as it has too many digits after the first 1 so that may well be how it manages the rounding on imprecise floats.

safinaskar commented 11 months ago

When I open that original big file in Numbers, I see 0.00016450000000000000 there

masaccio commented 11 months ago

When I open that original big file in Numbers, I see 0.00016450000000000000 there

Try adding an extra digit to that decimal; it will be rounded. 0.000164500000000000001 rounds to 0.0001645 as soon as you enter It into a cell.

Numbers can add the zeroes at the end to give a number of decimal places which is what you see, but it cannot store more than 15 significant digits. The solution will be to truncate the number of digits to match the imprecision that Numbers has internally. Using decimal.Decimal is not the answer after all as that has more precision than Numbers itself.

I have formatting tests I can generate and can apply each of these to a number that fails to round precisely and see how Numbers presents it with its various formatting options. I've already replicated the output for the simple case by using significant figures rounding.

I believe you are looking for the output to match Numbers rather than representing values that Numbers cannot represent. I think matching the behaviour of Numbers is the right approach as well as ensuring that values round trip through load and save.

safinaskar commented 11 months ago

I believe you are looking for the output to match Numbers

Let me tell what I think numbers-parser should do. Let's assume we somehow got some input .numbers file. Let's open it in Numbers. We see that Numbers rendered some number as A. Then let's process the same .numbers file using cat-numbers. Let's assume cat-numbers printed the same number as B. I want these two properties to hold:

I don't need any additional properties. I. e. I don't need preserving trailing zeroes.

Also, let me remind that I already solved my original problem, so actually I don't need anything from numbers-parser anymore. I just share my thoughts to help you

The solution will be to truncate the number of digits to match the imprecision that Numbers has internally

I don't understand why you need this. This may be helpful when producing .numbers files from numbers-parser, but I thought we are talking about consuming them

masaccio commented 11 months ago

In 4.3.0:

% cat-numbers -b --formatting finalt.numbers 
0.00016450000000000000
0.0000000000000001
0.0001645

Regarding your truncation comment: Numbers does not store 0.0001645 as a string but instead converts its internal value to a decimal128 which results in imprecise values during floating point conversion. This is what you were seeing previously.

safinaskar commented 11 months ago

@masaccio, thanks! But this output still looks wrong to me:

# ~/.local/bin/cat-numbers -b /tmp/a.numbers
0.00016450000000000001
1e-16
0.0001645

, because first numbers is wrong and second number is exponential. But if you don't think this is bug, I don't, too

masaccio commented 11 months ago

Thanks again. The first number is a float conversion artefact (fixed). The CSV output has always been default python output so I will leave as-is since it's still correct in a machine-readable context. --formatting is expected to replicate the formatting of numbers including stuff like zero/space padding.