Optimize worksheet output.

MarkPflug commented 1 year ago

Hello, I was adding your library to a set of benchmarks that I maintain, and noted that the performance of your library was pretty impressive compared to most of the other libraries I've tested. Looking at your implementation, it is very similar to my own library: Sylvan.Data.Excel. This PR contains a few optimizations that you might consider that I discovered while working on my library. LargeXlsx is certainly more useful than mine, as mine is extremely minimal for a specific use-case. I don't deal with styles at all, for example.

The three optimizations are:

1) don't use any WriteLine calls, just use Write. The new lines make the xml easier to read for a human but are unnecessary for Excel and inflate the file size. 2) Excel doesn't require the row/cell reference (r attribute) and will default to the "next" cell. You only need to write the r attribute when there is a gap, after you skip rows or columns. 3) You don't need to write the style (s attribute) when it is the default of 0, which is extremely common in most spreadsheets.

All of these changes can significantly improve the speed of writing and reduce the output file size.

These are the benchmark results (SS == WriteSharedStrings instead of Write(String)) comparing before and after this change.

BEFORE:

Method	Mean	Allocated	FileSize
LargeXlsxSS	765.9 ms	63977.16 KB	6158 KB
LargeXlsx	805.2 ms	99780.2 KB	6810 KB

AFTER

Method	Mean	Allocated	FileSize
LargeXlsxSS	421.1 ms	30537.88 KB	3224 KB
LargeXlsx	464.6 ms	66340.59 KB	3747 KB

salvois commented 1 year ago

Hi @MarkPflug , thanks for your feedback and your pull request! I'm glad to have a look at the proposed changes. The WriteLine's were actually very intentional, because they has helped inspecting very large files for troubleshooting while not needing a full XML format to ease human readability. I assumed (and measured if I remember correctly, but I'll double check) the extra newlines would be negligible. Out of curiosity, what does the "allocated" column in your benchmark represent? I guess it's not RAM, as LargeXlsx allocates basically "none" of it. I'm trying it all and let you know, thanks! Salvo

MarkPflug commented 1 year ago

Understand about the newlines. Hard to deal with large xml if it isn't formatted. I have a tool that I use to apply formatting in such cases for diagnosing issues, since Visual Studio refuses to handle files past some modest limit. If you want to keep them, you might explicitly set the TextWriter.NewLine to '\n', to save a character on windows systems.

The allocated column is memory allocations. I suspect these are almost entirely from the implicit allocations converting values (int double etc) to strings when writing. I wouldn't worry too much about them, since they're going to be gen-0 and very cheap.

On my phone right now, so sorry for any typos.

salvois commented 1 year ago

The performance improvement is impressive @MarkPflug ! I think this is yet another demonstration that the best optimization is not doing a useless job :) I'm merging your pull request and thinking about the newlines afterwards. Thanks, Salvo

salvois / LargeXlsx

Optimize worksheet output. #21