Add streaming of xlsx file support

jappeace commented 3 years ago

This goes a long way of implementing https://github.com/qrilka/xlsx/issues/132

It adds both a parser and a writer module for streaming xlsx files. Although inline with the library in general "only basic functionality at the moment".

The parser doesn't use conduit because we wanted to make it as fast as xlsx2csv, which we did (credits go to Tim). The writer is a lot slower and in the future we may need to change this API as well to speed it up. However, both are functional now and in production at supercede. This doesn't mean we're set on this particular approach and we welcome feedback.

I'm sorry for doing this as a massive code dump and I don't expect this to be accepted in a timely manner. However for us we had to get it done in a timely manner which is why we decided to work from a temporary fork. I hope these changes are still welcome in the upstream module even though it's a lot in one go.

jappeace commented 3 years ago

I've broken the microlens build because I use makePrisms. I'll define them in place instead.

qrilka commented 3 years ago

@jappeace thanks for contibuting this. I hope to find a window of time this weekend to read your PR. I have one question though - you talk about the solution being "fast" but what about having some at least basic benchmarks demonstrating that?

jappeace commented 3 years ago

Yes, parsing is as fast as the best substitute for streaming which is that xlsx2csv program. Compared to the existing code it's slower but it can handle 1 million row excell files now (which the existing implementation couldn't).

For a benchmark, do you mean to add it to benchmarks/Main and then compare it to the xlsx2csv program or the existing implementation? Or both?

qrilka commented 3 years ago

Just choose whatever you see more appropriate

jappeace commented 3 years ago

Sorry, haven't addressed all comments yet, nor did the benchmark, I'll see if I can get another window soon.

qrilka commented 3 years ago

No problem, there seem to be no particular rush

jappeace commented 2 years ago

These are the results of the benchmark, the parse stream solution is surprisingly competitive to the existing variant:

benchmarking readFile/with xlsx
time                 137.7 ms   (133.7 ms .. 140.8 ms)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 136.5 ms   (134.1 ms .. 138.8 ms)
std dev              3.580 ms   (2.321 ms .. 5.190 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 28.18 ms   (28.00 ms .. 28.48 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 28.33 ms   (28.10 ms .. 29.02 ms)
std dev              796.8 μs   (198.1 μs .. 1.473 ms)

benchmarking readFile/with stream (counting)
time                 13.57 ms   (13.49 ms .. 13.65 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.56 ms   (13.52 ms .. 13.61 ms)
std dev              120.9 μs   (96.27 μs .. 156.7 μs)

benchmarking readFile/with stream (reading)
time                 33.17 ms   (33.05 ms .. 33.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 33.11 ms   (32.93 ms .. 33.25 ms)
std dev              343.3 μs   (226.4 μs .. 545.2 μs)

benchmarking writeFile/stream
time                 88.02 ms   (87.62 ms .. 88.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.32 ms   (88.15 ms .. 88.49 ms)
std dev              290.4 μs   (181.4 μs .. 424.5 μs)

counting skips the parsing step and just counts the rows in an excel file. reading does a full parse of a row. Keep in mind the goal of streaming isn't speed but constant memory.

qrilka commented 2 years ago

@jappeace could you add non-streamed writing for comparison?

jappeace commented 2 years ago

Benchmarks for the existing writing function, I also added the streaming variant that creates a shared strings table:

benchmarking readFile/with xlsx
time                 130.6 ms   (127.9 ms .. 133.4 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 131.4 ms   (129.7 ms .. 133.3 ms)
std dev              2.832 ms   (1.906 ms .. 4.470 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking readFile/with xlsx fast
time                 27.49 ms   (27.29 ms .. 27.72 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.16 ms   (27.78 ms .. 28.97 ms)
std dev              1.159 ms   (298.6 μs .. 1.773 ms)
variance introduced by outliers: 10% (moderately inflated)

benchmarking readFile/with stream (counting)
time                 13.29 ms   (13.23 ms .. 13.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.32 ms   (13.28 ms .. 13.39 ms)
std dev              124.6 μs   (81.60 μs .. 214.2 μs)

benchmarking readFile/with stream (reading)
time                 32.86 ms   (32.70 ms .. 32.97 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 32.83 ms   (32.59 ms .. 32.97 ms)
std dev              373.9 μs   (155.5 μs .. 655.5 μs)

benchmarking writeFile/with xlsx
time                 83.07 ms   (82.81 ms .. 83.30 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 82.68 ms   (82.33 ms .. 82.85 ms)
std dev              415.2 μs   (170.9 μs .. 677.9 μs)

benchmarking writeFile/with stream (no sst)
time                 88.15 ms   (87.88 ms .. 88.35 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 88.00 ms   (87.83 ms .. 88.12 ms)
std dev              248.2 μs   (176.6 μs .. 321.5 μs)

benchmarking writeFile/with stream (sst)
time                 89.90 ms   (89.71 ms .. 90.11 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 89.95 ms   (89.85 ms .. 90.05 ms)
std dev              168.1 μs   (132.2 μs .. 223.6 μs)

In future we may have a faster variant for writing.

qrilka commented 2 years ago

Greatt work @jappeace This change looks quite good to release version 1.0 of the library but I'd like also to have https://github.com/ocramz/xeno/issues/53 so it could support GHC 9.2 as well. Unfortunately I didn't have time yet to look into that problem.

jappeace commented 2 years ago

I'll see if I can fix it Wednesday or in the holiday

qrilka / xlsx

Add streaming of xlsx file support #144