jmcnamara / rust_xlsxwriter

A Rust library for creating Excel XLSX files.
https://crates.io/crates/rust_xlsxwriter
Apache License 2.0
348 stars 25 forks source link

Help needed: input on "constant_memory" mode #111

Closed jmcnamara closed 1 month ago

jmcnamara commented 2 months ago

Constant Memory mode

The Python version of xlsxwriter has a constant_memory mode that limits the amount of data stored in memory.

The optimization works by flushing each row to disk after a subsequent row is written. In this way the largest amount of data held in memory for a worksheet is the amount of data required to hold a single row of data.

When the overall worksheet file is written the on-disk data is copied in at the correct point in the file.

The tables below shows the advantage of this approach.

Constant Memory performance data

XlsxWriter in normal operation mode: the execution time and memory usage increase more or less linearly with the number of rows:

Rows Columns Time (s) Memory (bytes)
200 50 0.43 2346728
400 50 0.84 4670904
800 50 1.68 8325928
1600 50 3.39 17855192
3200 50 6.82 32279672
6400 50 13.66 64862232
12800 50 27.60 128851880

Note, the rust_xlsxwriter memory usage is lower but follows a similar linear trend.

XlsxWriter in constant_memory mode: the execution time still increases linearly with the number of rows but the memory usage remains small and constant:

Rows Columns Time (s) Memory (bytes)
200 50 0.37 62208
400 50 0.74 62208
800 50 1.46 62208
1600 50 2.93 62208
3200 50 5.90 62208
6400 50 11.84 62208
12800 50 23.63 62208

In constant_memory mode the performance should be approximately the same as normal mode.

Limitation of Constant Memory mode

The constant_memory mode is intended for writing large data sets efficiently but it has some limitations:

How it works in practice

Consider the following worksheet with 3 strings, one of which (Pear) is in bold, and 2 numbers. The numbers are both 1.23456 but the second one is formatted to 0.00:

screenshot

Here is what the XML of the worksheet looks like. Horizontal and vertical whitespace has been added for clarity.

<worksheet xmlns="..." xmlns:r="...s">
  <dimension ref="B2:D6"/>
  <sheetViews>
    <sheetView tabSelected="1" workbookViewId="0"/>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15"/>

  <sheetData>
    <row r="2" spans="2:4">
      <c r="B2" t="s">
        <v>0</v>
      </c>
      <c r="D2">
        <v>1.23456</v>
      </c>
    </row>

    <row r="3" spans="2:4">
      <c r="D3" s="1">
        <v>1.23456</v>
      </c>
    </row>

    <row r="4" spans="2:4">
      <c r="B4" s="2" t="s">
        <v>1</v>
      </c>
    </row>

    <row r="6" spans="2:4">
      <c r="B6" t="s">
        <v>2</v>
      </c>
    </row>

  </sheetData>

  <pageMargins/>
</worksheet>

Some things to notice here:

The issue with writing this information in a row by row method is that there are two "global" pieces of information that need to be known at the time of writing: the string <v> index and the formatted cells s="" attribute format index.

For example, each unique format in the workbook (with a unique combination of properties such as bold, italic, number format, color) must have a unique ID (1 and 2 in the example above). This must be known at the time of writing the data so that the it can be added to the cell XML element like the s="1" (style == 1) in the above example:

      <c r="D3" s="1">
        <v>1.23456</v>
      </c>

The Python library resolves these issues in two ways. For the string data it uses another Excel cell type called inlineStr to store string data without a lookup. The output xml in that case would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="..." xmlns:r="...">
  <dimension ref="B2:D6"/>
  <sheetViews>
    <sheetView tabSelected="1" workbookViewId="0"/>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15"/>

  <sheetData>
    <row r="2">
      <c r="B2" t="inlineStr">
        <is>
          <t>Apple</t>
        </is>
      </c>
      <c r="D2">
        <v>1.23456</v>
      </c>
    </row>

    <row r="3">
      <c r="D3" s="1">
        <v>1.23456</v>
      </c>
    </row>

    <row r="4">
      <c r="B4" s="2" t="inlineStr">
        <is>
          <t>Pear</t>
        </is>
      </c>
    </row>

    <row r="6">
      <c r="B6" t="inlineStr">
        <is>
          <t>Orange</t>
        </is>
      </c>
    </row>
  </sheetData>

  <pageMargins/>
</worksheet>

The format index is obtained via a workbook global lookup when the format is created like this:

bold = workbook.add_format({"bold": True})
num_format = workbook.add_format({"num_format": "0.00"})

Note that the formats are created by the "Workbook" and therefore their index can be generated uniquely at creation time.

As an aside, the string indexes could also be handled like this but since there was the inlineStr cell type was available that was used instead.

The problem in rust_xlsxwriter

The inline string method described in the previous section can also be used with rust_xlsxwriter. However, Format objects aren't created or owned by a Workbook object so that presents a challenge.

For comparison the file above would be created using the current API like this:

// Current API and behaviour.
use rust_xlsxwriter::{Format, Workbook, XlsxError};

fn main() -> Result<(), XlsxError> {
    // Create a new Excel file object.
    let mut workbook = Workbook::new();

    // Add some formats.
    let bold = Format::new().set_bold();
    let num_format = Format::new().set_num_format("0.00");

    // Add a worksheet to the workbook.
    let worksheet = workbook.add_worksheet();

    // Write some data to the worksheet.
    worksheet.write(1, 1, "Apple")?;
    worksheet.write(1, 3, 1.23456)?;

    worksheet.write_with_format(2, 3, 1.23456, &num_format)?;

    worksheet.write_with_format(3, 1, "Pear", &bold)?;

    worksheet.write(5, 1, "Orange")?;

    // Save the file to disk.
    workbook.save("test.xlsx")?;

    Ok(())
}

The scheme I intend to use is to require that formats are "registered" with a workbook, which will give them a unique ID that that can referred to when writing the format attribute.

Something like this:

// PROPOSED OPTIONAL API.
use rust_xlsxwriter::{Format, Workbook, XlsxError};

fn main() -> Result<(), XlsxError> {
    // Create a new Excel file object.
    let mut workbook = Workbook::new();

    // Add some formats.
    let bold = Format::new().set_bold();
    let num_format = Format::new().set_num_format("0.00");

    // Add a worksheet to the workbook.
    let mut worksheet = workbook.add_worksheet();

    // NEW.
    // Set the worksheet constant memory mode.
    worksheet.use_constant_memory_mode();

    // NEW.
    // Register the formats before using them.
    workbook.register_format(&num_format); // Gets an ID of 1.
    workbook.register_format(&bold);       // Gets an ID of 2.

    // Write some data to the worksheet.
    worksheet.write(1, 1, "Apple")?;
    worksheet.write(1, 3, 1.23456)?;

    worksheet.write_with_format(2, 3, 1.23456, &num_format)?;

    worksheet.write_with_format(3, 1, "Pear", &bold)?;

    worksheet.write(5, 1, "Orange")?;

    // Save the file to disk.
    workbook.save("test.xlsx")?;

    Ok(())
}

Trying to write data with an unregistered format would raise an XlsxError.

Help needed

If you have read this far, thanks. I'm looking for input for other ways of handling this.

One possible method would be to use an on-disk version of the internal BTreeMap container that stores all the worksheet cell information. This would probably be the least disruptive solution and would eliminate the need to work around the Table/Merged Range limitations listed above. There seems to be some existing implementations such as:

Another solution might be storing the cell information in an SQL/LITE DB.

Any other thoughts, suggestions?

zachbateman commented 2 months ago

First, thank you for all the work you put into this project! rust_xlsxwriter is a fantastic crate.

Most of the discussion here is over my head, but I'd prefer the most simple solution.

Your workbook.register_format(&format1); method seems like it would be okay but error prone with users likely to miss registering a format at times. Would not registering a format result in some form of error/feedback or just not apply the format in the file?

What if all formats were registered at one spot where the workbook is saved? Including the formats in the save call would remind users about this step and keep it confined to one location. Something like:

// If not using/registering any formats
workbook.save_constant_memory(path, None);

// If using formats, register them all at once
workbook.save_constant_memory(path, Some(vec![&format1, &format2, ...]));

Or, what if the Workbook object did keep track of used Format objects so that whenever a Format is applied, the Workbook internally registers it and uses the same id for matching formats?

jmcnamara commented 2 months ago

@zachbateman Thanks for the feedback and the questions. First let me clarify one important part that I left out of the explanation above. I'll update the introduction with the new information.

Each unique format in the workbook (with a unique combination of properties such as bold, italic, number format, color) must have a unique ID. And, in the "constant_memory" mode case that unique ID must be known at the time of writing the data so that the it can be added to the cell XML element like the s="1" (style == 1) in the above example:

      <c r="D3" s="1">
        <v>1.23456</v>
      </c>

Therefore in the case where the data is written when the user calls write() the format ID can't be generated when the Format is created (since the final properties of the format and it uniqueness are unknown) or at save() since that is too late. It is also worth noting that calculating the IDs at save() time is what the library does at the moment in the non "constant memory" case.

Would not registering a format result in some form of error/feedback or just not apply the format in the file?

Yes! Trying to call write_with_format() with an unregistered format would raise an error.

I'll update the discussion above with some examples to help clarify how it would work in practice.

han1548772930 commented 2 months ago

The registration method is good. If you don't need to order and the key is a number, you can use nohash-hasher, which should be faster.

jmcnamara commented 2 months ago

@han1548772930

If you don't need to order and the key is a number, you can use nohash-hasher, which should be faster.

Thanks for that. The hash key needs to be made up from all the format properties so the simple no_hash method won't work. Anyway, there is already a Format hash function in the code and it is fairly efficient (even though it won't be called very often in this use case):

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/format.rs#L454

The register_format() was also in the code base but I removed it last year. So that part is also ready to go:

https://github.com/jmcnamara/rust_xlsxwriter/commit/dcc3a3d084efd46e316188c80ef63226c7d67a42#diff-4ea682f1327d5715d346afbd6ceac182cb6c8c92ada03a8cc995255b7253164dL715-L737

In fact most of the internal plumbing is there. However, before going down that route (which will work since the Python/C/Perl version use it) I'm sounding people out in case there is a better way of handling this.

zachbateman commented 2 months ago

Using register_format() and raising an error that clearly describes the issue while in constant_memory mode seems good.

For me, this would be an optimization that I'd use if specifically needed for a large file, and in that case I'd already be looking deeper into the docs. Adding register_format() calls that are checked with clear error messages wouldn't be a problem.

jmcnamara commented 2 months ago

I have an initial working version of the "constant memory" mode on the constant_memory branch. It currently has limited functionality but there is enough to allow me to benchmark the potential savings.

The memory usage profile is effectively flat (as designed):

Cells Standard - Size (MB) Constant Memory - Size (MB) Standard - Time (s) Constant Memory - Time (s)
100,000 16.213 0.021 0.101 0.088
200,000 32.405 0.021 0.214 0.179
300,000 52.794 0.021 0.335 0.276
400,000 64.793 0.021 0.443 0.369
500,000 76.792 0.021 0.564 0.468
600,000 105.569 0.021 0.673 0.564
700,000 117.567 0.021 0.768 0.669
800,000 129.567 0.021 0.874 0.799
900,000 141.566 0.021 1.002 0.862
1,000,000 153.565 0.021 1.081 1.022

Which looks like this:

memory

Similarly to the Python version the performance is also slightly better (5-15%) in this mode:

perf

The tests were run like this:

./target/release/examples/app_memory_test 4000
./target/release/examples/app_memory_test 4000 --constant-memory

So the initial results are good. I'll continue with the functionality.

jmcnamara commented 2 months ago

I've got initial format support working using the "register format" method proposed above. It is in the latest version on the branch.

However, it is brittle and there are a lot of edge cases to work around. So I'm going to take a detour and see if I can implement the shared format ids via an Arc<RwLock<HashMap>> or similar.

jmcnamara commented 2 months ago

I got the shared Format lookup functionality working via a Arc<RwLock<HashMap>>.

Now it isn't required to "register" the formats. The only change needed to a standard program is the to change workbook.add_worksheet() to workbook.add_worksheet_with_constant_memory() and everything else happens automatically.

The is a better user experience and also less error prone internally.

@adriandelgado when you get a chance could you check my RwLock implementation here: https://github.com/jmcnamara/rust_xlsxwriter/blob/constant_memory/src/workbook.rs#L2118C1-L2139C1

The function can't (as far as I can see) be called twice from the same thread so it is unlikely to deadlock. If there are a lot of read() calls it could starve a write() call. However, if that is an issue it could be fixed/worked around.

jmcnamara commented 1 month ago

Available upstream in v0.78.0.