masaccio / numbers-parser

Python module for parsing Apple Numbers .numbers files
MIT License
208 stars 15 forks source link

Doesn't work if there are more than 65536 rows #50

Closed safinaskar closed 1 year ago

safinaskar commented 1 year ago

numbers-parser seem not to work if an input .numbers document has more than 65536 rows.

Someone gave to me .numbers document. I have PC with Linux installed, so I have no Numbers. I installed numbers-parser and converted the document to CSV using cat-numbers. But resulting CSV documents has 65536 normal rows and then I see ,,,,,,,,,,,,,,,, (or None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None in --formatting mode). And I think that original document has additional data.

Unfortunately, I don't want to provide original document via public Github issue, because it contains confidential data.

Also, I'm not sure this is numbers-parser's problem, it is possible that they created broken document in the first place

masaccio commented 1 year ago

There shouldn't be a problem with that many rows; Numbers supports documents up to 1M rows.

I just tried an empty document with >180k rows adding some text at the end and cat-numbers did what I expected in dumping all rows. Something you can try to see if the document is corrupted is using SheetJS. It stores data in-browser so you're not transferring the document to their server.

Failing that if you trust me with the document, I am happy to share a dropbox URL with you.

This also works as expected:

from numbers_parser import Document
doc = Document()
sheets = doc.sheets
tables = sheets[0].tables
table = tables[0]
x = 0
for row in range(0, 100000):
    for col in range(0, 4):
        table.write(row, col, x)
        x += 1
doc.save("large.numbers")
% cat-numbers large.numbers | wc -l
  100000
safinaskar commented 1 year ago

One of my files has size 99 Mb. I tried to upload it to https://oss.sheetjs.com/ and to https://sheetjs.com/sql/ , but both sites hanged.

Thanks anyway. I will try to parse files other ways

masaccio commented 1 year ago

There are a couple of suggestions here: https://askubuntu.com/questions/408306/is-there-a-way-to-read-osx-numbers-files

LibreOffice doesn't advertise .numbers support, but random Internet person claims it works. Creating a free iCloud account, even if it's just a trial is a good idea as you'll be able to upload and then download as Excel. It's also a sure-fire way to see if the file is corrupted. If iCloud can't load it, then it's borked.

safinaskar commented 1 year ago

@masaccio , I wrote Rust program, and was able to fully restore data from that .numbers file using that program. So, yes, the file is correct.

I still don't want to share .numbers file itself, but if you want, I can share that Rust program.

Also, if you want, I can create iCloud account and try to create similar .numbers file with fake data using iCloud and reproduce the bug and share the file.

Some hints about file contents: it is table with more than 65536 rows and 17 columns. Full of different data. With many different strings and many different numbers

One can think of one possible source of bug, but I'm not sure. It is overflow in ListEntry.key ( https://github.com/psobot/keynote-parser/blob/7114e3b6594a68d6c6885f469c7b4b3bdc27eb86/protos/TSTArchives.proto#L227 ). In my files this key can overflow 65536, and thus this key embedded in TileRowInfo.cell_storage_buffer = 6 ( https://github.com/psobot/keynote-parser/blob/7114e3b6594a68d6c6885f469c7b4b3bdc27eb86/protos/TSTArchives.proto#L128 ) can occupy more that 2 bytes

masaccio commented 1 year ago

That's an easy experiment to try. My example above used numbers rather than string keys. Thanks for the pointers

masaccio commented 1 year ago

Yup when I use strings in that example above, I get 2^10 strings dumped and then nothing. File creation actually works, so it's 'just' in reading. Will fix and if you don't mind testing the fix, that would be great.

safinaskar commented 1 year ago

Yes, I will test

masaccio commented 1 year ago

@safinaskar should be working in 3.10.1. Numbers breaks the row storage maps into chunks of 64k entries which I was not supporting in read.

safinaskar commented 1 year ago

Yes, now it works