"Big-Data File" or "Database" based backend for very large spreadsheets

RoloEdits commented 8 months ago

Building off an idea that excelize uses, we could save really large datasets onto disk as a file.

Big Data

Some options:

pandas has documentation on working with large datasets and one of the solutions is to do something similar.

It could be behind a feature flag like storage, so as to keep the implementation out of the public API. This would then change from building a structure in memory, and start to write to a file in the OS's temp directory. Parsing the xml file probably takes longer than writing the file to disk, the performance would decrease compared to a in-memory only solution, but if data would never fit in-memory in the first place, then having a file format optimized for this kind of task.

Some formats need to be saved all at once, meaning the data would need to be filled in memory to then save it. A workaround would be to save smaller chunks to file and then aggregate the data together then needed. This idea of 'chunking' is also mentioned often as a solution in online forms when working with large amounts of data.

If/when calamine offers writing and editing, for the read-only formats, we could update these smaller chunks as a new file with the new values and invalidate the old one. We could just also not offer a writing mechanism when this is the backing storage.

The files would most likely only store the values themselves, so other metadata for the cells would have to still be kept in memory, if wanted, and this could still fill up a lot of memory with things like cell fill color, font, border, format, etc. So in this case, a styles flag would be incompatible with a storage flag.

SQLite

Outside of the "Big Data" formats, something like SQLite could be something to look into. A single .db file would store the data, updates can be done in-place, and cell metadata could be stored in column with the value.

CREATE TABLE spreadsheet (
    row INT,
    column INT,
    value TEXT,
    color TEXT,
-- etc.
)

And index can be created over the column row pair for increased performance on iteration.

Questions

Something to be aware of is that most, if not all, the formats have datatypes. Matching the cells values with the proper data type could be more challenging than it initially seems.

With SQLite as an example, the field would have to be TEXT to store all the values in the same field, and another field would need to be added to keep track of the type.

Another is that in a "worst case scenario" the sharedStrings file could be huge, and with the wrong kind of dataset, this alone could fill up memory. For the SQLite solution, there could just be a shared strings table with the ID as the index. The spreadsheet table would then have a foreign key pointing to the ID of the shared string table. Though this would mean that we can't store all the values in the value column, as there would be values that don't reference the sharedStrings data. But with this figured out, it would allow asynchronous reads/writes. keeping the memory used to whatever the buffers are sized at.

Another database option could be in the form of surrealdb, which has more flexibility in how data is represented and "joined". It also comes with a file back solution.

Of all these suggestions, I must say I gravitate towards the database options the most. It seems the simplest to implement and offers the most straight forward way to achieve the most flexibility while keeping up performance.

The last main point to keep in mind is what the most likely use case is for this big data. If its processing, analytics, saving to new format, editing and resaving, they all can require different workloads that would lend to certain implementations. Trying to pin down the use case to design against would probably be a good first step, if its possible.

dimastbk commented 8 months ago

Maybe implement lazy iteration? And applications can use any of engines to save to disk. Two problems with it:

random cell selection;
shared string in xlsx, but how big they can be?

RoloEdits commented 8 months ago

Yeah, true lazy iteration is going to be a challenge. I don't think its even possible to do a truly lazy implementation. I've been looking into other libs to see what they are doing, but its taking me a while to navigate them and try to piece together what they are doing about it.

Even if we only allowed lazy iteration over only cells, not rows or columns, the problem of having to scrape the shared strings and styles file doesn't go away.

Of the workload problems, I split them between wall clock and memory constraints.

For the wall clock issue I have made some subsets that we could add in to help with some workloads. In the first subset, we load all the sheets and information into memory, parsing them in parallel. We represent the range fully, empty value cells included. What you index into is what you would expect. This would be an open() function.

Another version of this could be to only parse specified sheets, so not all sheets go into memory, only the given. This could be like we parse only the metadata of the sheets, like the name and index, and using this to initiate a full parse for the sheets passed in by the user. So a Workbook::open_lazy("WORKBOOK") could get the metadata, and a workbook.sheets_by_name(&["Sheet1", "Sheet5"]) would then parse the two in parallel for direct and immediate use.

For some potential memory savings, an optimization can be made when handling sparse data. The shared strings and sheets are all loaded into memory, but the backing container of the cells discards the empty value cells, compacting all the data together. There would be like a open_sparse() and an open_sparse_lazy(). This saves some space, only getting a spreadsheets values, but the indexing becomes very constrained to the sheet itself. Besides some helper methods like a .last(), which gets last column for that row which has values, indexing needs prior knowledge of the sheet. But could be what the workload calls for.

In both cases its expected that the use case requires multiple sheets of data and random access to cells directly at ones immediate disposal.

For pure memory optimizations, the options are much more constrained. Even only opening the sheet that is needed has some very easily reachable limits when you take into account that the shared string file still needs to be read. open_lazy() would still load the full thing into memory. And the full data from the sheets is also loaded. So not really a true lazy implementation.

But beyond that, the next solution I can think of would go straight to the filesystem. Employing some of open_lazy() we could only load the specified sheets, but parse all of the shared strings into a shared_strings table, using a .db file as an example, and then parse the specified sheets into a sheet table with the info needed to construct a sheet and its cells together. Iterating over the rows would be some kind of SELECT value, type, column FROM sheets WHERE sheet = $1 AND row = $2. There could be an optimization here to get like 10 rows at a time and keep track of the state in the iterator.

The two extremes of open everything into memory all at once and open everything onto disk all at once cover the full range of the constraints I laid out. But none of these solutions make accessing a random location, like the first row, any faster, a constraint that sits in the middle of the two I specified, but also kind of its own thing. Its improves speed, but really its more like first row speed. And memory remains low, but really its just scoped to the iteration scope. Its not that you have an issue where you can't fit the data into memory scope and thus turn to it. Its almost more of a biproduct that the memory is lower, in my opinion, rather than being the goal of lazy iteration. The real benefit is that you could access earlier info without needing to parse all of it which you wont use.

I am still working on an write-up of working towards a 1.0 API. I am adding to it everyday when I have some time to work on it. I'm trying to flesh out some of the main challenges and try to at least provide a direction to go in. Lazy iteration, unfortunately, is something I'm not really closer on figuring out. I've only been programing for about a year(This project was actually my first OS contribution), so I don't have any experience to lean on for these things. I'll try to get it up sometime before the end of this month though.

tafia commented 8 months ago

Thanks for writing this up.

While I am clearly not against it, (I find it interesting) I'm afraid it'll be very complex. I want to explore streaming (by rows or cells) iterators first, even if it seems complex. I also have the feeling that very large excel files, while a reality is pretty niche, as this type of workflow should have since then moved into other formats/database (excel struggles opening these files).

shared string in xlsx, but how big they can be?

For reference, the file used in the dataset has

about 60M characters for shared strings.
In the current implementation there is also ~60MB overhead (fat pointers)
Using the nested PR approach, the overhead should go down to ~20MB, using u32 in place of usize for indices would push that to ~10MB (so ~70MB total).

In an ideal world (for xlsx at least, other file format behave differently), there should be a "mode" where we can pull the sharedString directly from the zip, on demand. For instance we could scan it first and only store string indexes etc ... It'd then be much smaller.

tafia commented 8 months ago

Closing this now for #370 which will bring support for lazy cell reading. It does not work for xls and ods files though, feel free to reopen,

tafia / calamine