Closed jappeace closed 2 years ago
I've broken the microlens build because I use makePrisms. I'll define them in place instead.
@jappeace thanks for contibuting this. I hope to find a window of time this weekend to read your PR. I have one question though - you talk about the solution being "fast" but what about having some at least basic benchmarks demonstrating that?
Yes, parsing is as fast as the best substitute for streaming which is that xlsx2csv program. Compared to the existing code it's slower but it can handle 1 million row excell files now (which the existing implementation couldn't).
For a benchmark, do you mean to add it to benchmarks/Main
and then compare it to the xlsx2csv program or the existing implementation? Or both?
Just choose whatever you see more appropriate
Sorry, haven't addressed all comments yet, nor did the benchmark, I'll see if I can get another window soon.
No problem, there seem to be no particular rush
These are the results of the benchmark, the parse stream solution is surprisingly competitive to the existing variant:
benchmarking readFile/with xlsx
time 137.7 ms (133.7 ms .. 140.8 ms)
0.999 R² (0.996 R² .. 1.000 R²)
mean 136.5 ms (134.1 ms .. 138.8 ms)
std dev 3.580 ms (2.321 ms .. 5.190 ms)
variance introduced by outliers: 11% (moderately inflated)
benchmarking readFile/with xlsx fast
time 28.18 ms (28.00 ms .. 28.48 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 28.33 ms (28.10 ms .. 29.02 ms)
std dev 796.8 μs (198.1 μs .. 1.473 ms)
benchmarking readFile/with stream (counting)
time 13.57 ms (13.49 ms .. 13.65 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 13.56 ms (13.52 ms .. 13.61 ms)
std dev 120.9 μs (96.27 μs .. 156.7 μs)
benchmarking readFile/with stream (reading)
time 33.17 ms (33.05 ms .. 33.32 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 33.11 ms (32.93 ms .. 33.25 ms)
std dev 343.3 μs (226.4 μs .. 545.2 μs)
benchmarking writeFile/stream
time 88.02 ms (87.62 ms .. 88.33 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 88.32 ms (88.15 ms .. 88.49 ms)
std dev 290.4 μs (181.4 μs .. 424.5 μs)
counting skips the parsing step and just counts the rows in an excel file. reading does a full parse of a row. Keep in mind the goal of streaming isn't speed but constant memory.
@jappeace could you add non-streamed writing for comparison?
Benchmarks for the existing writing function, I also added the streaming variant that creates a shared strings table:
benchmarking readFile/with xlsx
time 130.6 ms (127.9 ms .. 133.4 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 131.4 ms (129.7 ms .. 133.3 ms)
std dev 2.832 ms (1.906 ms .. 4.470 ms)
variance introduced by outliers: 11% (moderately inflated)
benchmarking readFile/with xlsx fast
time 27.49 ms (27.29 ms .. 27.72 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 28.16 ms (27.78 ms .. 28.97 ms)
std dev 1.159 ms (298.6 μs .. 1.773 ms)
variance introduced by outliers: 10% (moderately inflated)
benchmarking readFile/with stream (counting)
time 13.29 ms (13.23 ms .. 13.35 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 13.32 ms (13.28 ms .. 13.39 ms)
std dev 124.6 μs (81.60 μs .. 214.2 μs)
benchmarking readFile/with stream (reading)
time 32.86 ms (32.70 ms .. 32.97 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 32.83 ms (32.59 ms .. 32.97 ms)
std dev 373.9 μs (155.5 μs .. 655.5 μs)
benchmarking writeFile/with xlsx
time 83.07 ms (82.81 ms .. 83.30 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 82.68 ms (82.33 ms .. 82.85 ms)
std dev 415.2 μs (170.9 μs .. 677.9 μs)
benchmarking writeFile/with stream (no sst)
time 88.15 ms (87.88 ms .. 88.35 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 88.00 ms (87.83 ms .. 88.12 ms)
std dev 248.2 μs (176.6 μs .. 321.5 μs)
benchmarking writeFile/with stream (sst)
time 89.90 ms (89.71 ms .. 90.11 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 89.95 ms (89.85 ms .. 90.05 ms)
std dev 168.1 μs (132.2 μs .. 223.6 μs)
In future we may have a faster variant for writing.
Greatt work @jappeace This change looks quite good to release version 1.0 of the library but I'd like also to have https://github.com/ocramz/xeno/issues/53 so it could support GHC 9.2 as well. Unfortunately I didn't have time yet to look into that problem.
I'll see if I can fix it Wednesday or in the holiday
This goes a long way of implementing https://github.com/qrilka/xlsx/issues/132
It adds both a parser and a writer module for streaming xlsx files. Although inline with the library in general "only basic functionality at the moment".
The parser doesn't use conduit because we wanted to make it as fast as xlsx2csv, which we did (credits go to Tim). The writer is a lot slower and in the future we may need to change this API as well to speed it up. However, both are functional now and in production at supercede. This doesn't mean we're set on this particular approach and we welcome feedback.
I'm sorry for doing this as a massive code dump and I don't expect this to be accepted in a timely manner. However for us we had to get it done in a timely manner which is why we decided to work from a temporary fork. I hope these changes are still welcome in the upstream module even though it's a lot in one go.