salvois / LargeXlsx

A .net library to write large Excel files in XLSX format with low memory consumption using streamed write.
Other
204 stars 34 forks source link

Optimize worksheet output. #21

Closed MarkPflug closed 1 year ago

MarkPflug commented 1 year ago

Hello, I was adding your library to a set of benchmarks that I maintain, and noted that the performance of your library was pretty impressive compared to most of the other libraries I've tested. Looking at your implementation, it is very similar to my own library: Sylvan.Data.Excel. This PR contains a few optimizations that you might consider that I discovered while working on my library. LargeXlsx is certainly more useful than mine, as mine is extremely minimal for a specific use-case. I don't deal with styles at all, for example.

The three optimizations are:

1) don't use any WriteLine calls, just use Write. The new lines make the xml easier to read for a human but are unnecessary for Excel and inflate the file size. 2) Excel doesn't require the row/cell reference (r attribute) and will default to the "next" cell. You only need to write the r attribute when there is a gap, after you skip rows or columns. 3) You don't need to write the style (s attribute) when it is the default of 0, which is extremely common in most spreadsheets.

All of these changes can significantly improve the speed of writing and reduce the output file size.

These are the benchmark results (SS == WriteSharedStrings instead of Write(String)) comparing before and after this change.

BEFORE:

Method Mean Allocated FileSize
LargeXlsxSS 765.9 ms 63977.16 KB 6158 KB
LargeXlsx 805.2 ms 99780.2 KB 6810 KB

AFTER

Method Mean Allocated FileSize
LargeXlsxSS 421.1 ms 30537.88 KB 3224 KB
LargeXlsx 464.6 ms 66340.59 KB 3747 KB
salvois commented 1 year ago

Hi @MarkPflug , thanks for your feedback and your pull request! I'm glad to have a look at the proposed changes. The WriteLine's were actually very intentional, because they has helped inspecting very large files for troubleshooting while not needing a full XML format to ease human readability. I assumed (and measured if I remember correctly, but I'll double check) the extra newlines would be negligible. Out of curiosity, what does the "allocated" column in your benchmark represent? I guess it's not RAM, as LargeXlsx allocates basically "none" of it. I'm trying it all and let you know, thanks! Salvo

MarkPflug commented 1 year ago

Understand about the newlines. Hard to deal with large xml if it isn't formatted. I have a tool that I use to apply formatting in such cases for diagnosing issues, since Visual Studio refuses to handle files past some modest limit. If you want to keep them, you might explicitly set the TextWriter.NewLine to '\n', to save a character on windows systems.

The allocated column is memory allocations. I suspect these are almost entirely from the implicit allocations converting values (int double etc) to strings when writing. I wouldn't worry too much about them, since they're going to be gen-0 and very cheap.

On my phone right now, so sorry for any typos.

salvois commented 1 year ago

The performance improvement is impressive @MarkPflug ! I think this is yet another demonstration that the best optimization is not doing a useless job :) I'm merging your pull request and thinking about the newlines afterwards. Thanks, Salvo