Closed jmcnamara closed 1 month ago
First, thank you for all the work you put into this project! rust_xlsxwriter is a fantastic crate.
Most of the discussion here is over my head, but I'd prefer the most simple solution.
Your workbook.register_format(&format1);
method seems like it would be okay but error prone with users likely to miss registering a format at times. Would not registering a format result in some form of error/feedback or just not apply the format in the file?
What if all formats were registered at one spot where the workbook is saved? Including the formats in the save call would remind users about this step and keep it confined to one location. Something like:
// If not using/registering any formats
workbook.save_constant_memory(path, None);
// If using formats, register them all at once
workbook.save_constant_memory(path, Some(vec![&format1, &format2, ...]));
Or, what if the Workbook
object did keep track of used Format
objects so that whenever a Format
is applied, the Workbook
internally registers it and uses the same id for matching formats?
@zachbateman Thanks for the feedback and the questions. First let me clarify one important part that I left out of the explanation above. I'll update the introduction with the new information.
Each unique format in the workbook (with a unique combination of properties such as bold, italic, number format, color) must have a unique ID. And, in the "constant_memory" mode case that unique ID must be known at the time of writing the data so that the it can be added to the cell XML element like the s="1"
(style == 1) in the above example:
<c r="D3" s="1">
<v>1.23456</v>
</c>
Therefore in the case where the data is written when the user calls write()
the format ID can't be generated when the Format is created (since the final properties of the format and it uniqueness are unknown) or at save()
since that is too late. It is also worth noting that calculating the IDs at save()
time is what the library does at the moment in the non "constant memory" case.
Would not registering a format result in some form of error/feedback or just not apply the format in the file?
Yes! Trying to call write_with_format()
with an unregistered format would raise an error.
I'll update the discussion above with some examples to help clarify how it would work in practice.
The registration method is good. If you don't need to order and the key is a number, you can use nohash-hasher, which should be faster.
@han1548772930
If you don't need to order and the key is a number, you can use nohash-hasher, which should be faster.
Thanks for that. The hash key needs to be made up from all the format properties so the simple no_hash method won't work. Anyway, there is already a Format hash function in the code and it is fairly efficient (even though it won't be called very often in this use case):
https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/format.rs#L454
The register_format()
was also in the code base but I removed it last year. So that part is also ready to go:
In fact most of the internal plumbing is there. However, before going down that route (which will work since the Python/C/Perl version use it) I'm sounding people out in case there is a better way of handling this.
Using register_format()
and raising an error that clearly describes the issue while in constant_memory mode seems good.
For me, this would be an optimization that I'd use if specifically needed for a large file, and in that case I'd already be looking deeper into the docs. Adding register_format()
calls that are checked with clear error messages wouldn't be a problem.
I have an initial working version of the "constant memory" mode on the constant_memory
branch. It currently has limited functionality but there is enough to allow me to benchmark the potential savings.
The memory usage profile is effectively flat (as designed):
Cells | Standard - Size (MB) | Constant Memory - Size (MB) | Standard - Time (s) | Constant Memory - Time (s) |
---|---|---|---|---|
100,000 | 16.213 | 0.021 | 0.101 | 0.088 |
200,000 | 32.405 | 0.021 | 0.214 | 0.179 |
300,000 | 52.794 | 0.021 | 0.335 | 0.276 |
400,000 | 64.793 | 0.021 | 0.443 | 0.369 |
500,000 | 76.792 | 0.021 | 0.564 | 0.468 |
600,000 | 105.569 | 0.021 | 0.673 | 0.564 |
700,000 | 117.567 | 0.021 | 0.768 | 0.669 |
800,000 | 129.567 | 0.021 | 0.874 | 0.799 |
900,000 | 141.566 | 0.021 | 1.002 | 0.862 |
1,000,000 | 153.565 | 0.021 | 1.081 | 1.022 |
Which looks like this:
Similarly to the Python version the performance is also slightly better (5-15%) in this mode:
The tests were run like this:
./target/release/examples/app_memory_test 4000
./target/release/examples/app_memory_test 4000 --constant-memory
So the initial results are good. I'll continue with the functionality.
I've got initial format support working using the "register format" method proposed above. It is in the latest version on the branch.
However, it is brittle and there are a lot of edge cases to work around. So I'm going to take a detour and see if I can implement the shared format ids via an Arc<RwLock<HashMap>>
or similar.
I got the shared Format lookup functionality working via a Arc<RwLock<HashMap>>
.
Now it isn't required to "register" the formats. The only change needed to a standard program is the to change workbook.add_worksheet()
to workbook.add_worksheet_with_constant_memory()
and everything else happens automatically.
The is a better user experience and also less error prone internally.
@adriandelgado when you get a chance could you check my RwLock
implementation here: https://github.com/jmcnamara/rust_xlsxwriter/blob/constant_memory/src/workbook.rs#L2118C1-L2139C1
The function can't (as far as I can see) be called twice from the same thread so it is unlikely to deadlock. If there are a lot of read()
calls it could starve a write()
call. However, if that is an issue it could be fixed/worked around.
Available upstream in v0.78.0.
Constant Memory mode
The Python version of
xlsxwriter
has aconstant_memory
mode that limits the amount of data stored in memory.The optimization works by flushing each row to disk after a subsequent row is written. In this way the largest amount of data held in memory for a worksheet is the amount of data required to hold a single row of data.
When the overall worksheet file is written the on-disk data is copied in at the correct point in the file.
The tables below shows the advantage of this approach.
Constant Memory performance data
XlsxWriter
in normal operation mode: the execution time and memory usage increase more or less linearly with the number of rows:Note, the
rust_xlsxwriter
memory usage is lower but follows a similar linear trend.XlsxWriter
inconstant_memory
mode: the execution time still increases linearly with the number of rows but the memory usage remains small and constant:In
constant_memory
mode the performance should be approximately the same as normal mode.Limitation of Constant Memory mode
The
constant_memory
mode is intended for writing large data sets efficiently but it has some limitations:How it works in practice
Consider the following worksheet with 3 strings, one of which (Pear) is in bold, and 2 numbers. The numbers are both 1.23456 but the second one is formatted to
0.00
:Here is what the XML of the worksheet looks like. Horizontal and vertical whitespace has been added for clarity.
Some things to notice here:
The data is laid out in row and column format.
Each row has
<c>
cell elements. The cell has attributes:r
: The cell location.t
: The cell type.s
is a "shared string" and no attributes mean a "number" cell. The value for the type is stored in the<v>
sub-element.s
: The style/format of the cell, for cells with a format. There are two in this example:s="1"
for the number format in cell "D3" ands="2"
for the bold string format in cell "B4". (There is also an implicits="0"
format for all other cells.)The strings aren't visible in this file. Instead an index
0, 1, 2
in the<v>
elements is used to refer to a "Shared string" table. This is an optimization to avoid storing repeated string data.The numbers are stored at their original precision and the format is only applied when the file is displayed.
The issue with writing this information in a row by row method is that there are two "global" pieces of information that need to be known at the time of writing: the string
<v>
index and the formatted cellss=""
attribute format index.For example, each unique format in the workbook (with a unique combination of properties such as bold, italic, number format, color) must have a unique ID (1 and 2 in the example above). This must be known at the time of writing the data so that the it can be added to the cell XML element like the
s="1"
(style == 1) in the above example:The Python library resolves these issues in two ways. For the string data it uses another Excel cell type called
inlineStr
to store string data without a lookup. The output xml in that case would look like this:The format index is obtained via a workbook global lookup when the format is created like this:
Note that the formats are created by the "Workbook" and therefore their index can be generated uniquely at creation time.
As an aside, the string indexes could also be handled like this but since there was the
inlineStr
cell type was available that was used instead.The problem in
rust_xlsxwriter
The inline string method described in the previous section can also be used with
rust_xlsxwriter
. However,Format
objects aren't created or owned by a Workbook object so that presents a challenge.For comparison the file above would be created using the current API like this:
The scheme I intend to use is to require that formats are "registered" with a workbook, which will give them a unique ID that that can referred to when writing the format attribute.
Something like this:
Trying to write data with an unregistered format would raise an
XlsxError
.Help needed
If you have read this far, thanks. I'm looking for input for other ways of handling this.
One possible method would be to use an on-disk version of the internal BTreeMap container that stores all the worksheet cell information. This would probably be the least disruptive solution and would eliminate the need to work around the Table/Merged Range limitations listed above. There seems to be some existing implementations such as:
Another solution might be storing the cell information in an SQL/LITE DB.
Any other thoughts, suggestions?