jmcnamara / rust_xlsxwriter

A Rust library for creating Excel XLSX files.
https://crates.io/crates/rust_xlsxwriter
Apache License 2.0
324 stars 26 forks source link

Help needed: input on "constant_memory" mode #111

Open jmcnamara opened 5 days ago

jmcnamara commented 5 days ago

Constant Memory mode

The Python version of xlsxwriter has a constant_memory mode that limits the amount of data stored in memory.

The optimization works by flushing each row to disk after a subsequent row is written. In this way the largest amount of data held in memory for a worksheet is the amount of data required to hold a single row of data.

When the overall worksheet file is written the on-disk data is copied in at the correct point in the file.

The tables below shows the advantage of this approach.

Constant Memory performance data

XlsxWriter in normal operation mode: the execution time and memory usage increase more or less linearly with the number of rows:

Rows Columns Time (s) Memory (bytes)
200 50 0.43 2346728
400 50 0.84 4670904
800 50 1.68 8325928
1600 50 3.39 17855192
3200 50 6.82 32279672
6400 50 13.66 64862232
12800 50 27.60 128851880

Note, the rust_xlsxwriter memory usage is lower but follows a similar linear trend.

XlsxWriter in constant_memory mode: the execution time still increases linearly with the number of rows but the memory usage remains small and constant:

Rows Columns Time (s) Memory (bytes)
200 50 0.37 62208
400 50 0.74 62208
800 50 1.46 62208
1600 50 2.93 62208
3200 50 5.90 62208
6400 50 11.84 62208
12800 50 23.63 62208

In constant_memory mode the performance should be approximately the same as normal mode.

Limitation of Constant Memory mode

The constant_memory mode is intended for writing large data sets efficiently but it has some limitations:

How it works in practice

Consider the following worksheet with 3 strings, one of which (Pear) is in bold, and 2 numbers. The numbers are both 1.23456 but the second one is formatted to 0.00:

screenshot

Here is what the XML of the worksheet looks like. Horizontal and vertical whitespace has been added for clarity.

<worksheet xmlns="..." xmlns:r="...s">
  <dimension ref="B2:D6"/>
  <sheetViews>
    <sheetView tabSelected="1" workbookViewId="0"/>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15"/>

  <sheetData>
    <row r="2" spans="2:4">
      <c r="B2" t="s">
        <v>0</v>
      </c>
      <c r="D2">
        <v>1.23456</v>
      </c>
    </row>

    <row r="3" spans="2:4">
      <c r="D3" s="1">
        <v>1.23456</v>
      </c>
    </row>

    <row r="4" spans="2:4">
      <c r="B4" s="2" t="s">
        <v>1</v>
      </c>
    </row>

    <row r="6" spans="2:4">
      <c r="B6" t="s">
        <v>2</v>
      </c>
    </row>

  </sheetData>

  <pageMargins/>
</worksheet>

Some things to notice here:

The issue with writing this information in a row by row method is that there are two "global" pieces of information that need to be known at the time of writing: the string <v> index and the formatted cells s="" attribute format index.

For example, each unique format in the workbook (with a unique combination of properties such as bold, italic, number format, color) must have a unique ID (1 and 2 in the example above). This must be known at the time of writing the data so that the it can be added to the cell XML element like the s="1" (style == 1) in the above example:

      <c r="D3" s="1">
        <v>1.23456</v>
      </c>

The Python library resolves these issues in two ways. For the string data it uses another Excel cell type called inlineStr to store string data without a lookup. The output xml in that case would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="..." xmlns:r="...">
  <dimension ref="B2:D6"/>
  <sheetViews>
    <sheetView tabSelected="1" workbookViewId="0"/>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15"/>

  <sheetData>
    <row r="2">
      <c r="B2" t="inlineStr">
        <is>
          <t>Apple</t>
        </is>
      </c>
      <c r="D2">
        <v>1.23456</v>
      </c>
    </row>

    <row r="3">
      <c r="D3" s="1">
        <v>1.23456</v>
      </c>
    </row>

    <row r="4">
      <c r="B4" s="2" t="inlineStr">
        <is>
          <t>Pear</t>
        </is>
      </c>
    </row>

    <row r="6">
      <c r="B6" t="inlineStr">
        <is>
          <t>Orange</t>
        </is>
      </c>
    </row>
  </sheetData>

  <pageMargins/>
</worksheet>

The format index is obtained via a workbook global lookup when the format is created like this:

bold = workbook.add_format({"bold": True})
num_format = workbook.add_format({"num_format": "0.00"})

Note that the formats are created by the "Workbook" and therefore their index can be generated uniquely at creation time.

As an aside, the string indexes could also be handled like this but since there was the inlineStr cell type was available that was used instead.

The problem in rust_xlsxwriter

The inline string method described in the previous section can also be used with rust_xlsxwriter. However, Format objects aren't created or owned by a Workbook object so that presents a challenge.

For comparison the file above would be created using the current API like this:

// Current API and behaviour.
use rust_xlsxwriter::{Format, Workbook, XlsxError};

fn main() -> Result<(), XlsxError> {
    // Create a new Excel file object.
    let mut workbook = Workbook::new();

    // Add some formats.
    let bold = Format::new().set_bold();
    let num_format = Format::new().set_num_format("0.00");

    // Add a worksheet to the workbook.
    let worksheet = workbook.add_worksheet();

    // Write some data to the worksheet.
    worksheet.write(1, 1, "Apple")?;
    worksheet.write(1, 3, 1.23456)?;

    worksheet.write_with_format(2, 3, 1.23456, &num_format)?;

    worksheet.write_with_format(3, 1, "Pear", &bold)?;

    worksheet.write(5, 1, "Orange")?;

    // Save the file to disk.
    workbook.save("test.xlsx")?;

    Ok(())
}

The scheme I intend to use is to require that formats are "registered" with a workbook, which will give them a unique ID that that can referred to when writing the format attribute.

Something like this:

// PROPOSED OPTIONAL API.
use rust_xlsxwriter::{Format, Workbook, XlsxError};

fn main() -> Result<(), XlsxError> {
    // Create a new Excel file object.
    let mut workbook = Workbook::new();

    // Add some formats.
    let bold = Format::new().set_bold();
    let num_format = Format::new().set_num_format("0.00");

    // Add a worksheet to the workbook.
    let mut worksheet = workbook.add_worksheet();

    // NEW.
    // Set the worksheet constant memory mode.
    worksheet.use_constant_memory_mode();

    // NEW.
    // Register the formats before using them.
    workbook.register_format(&num_format); // Gets an ID of 1.
    workbook.register_format(&bold);       // Gets an ID of 2.

    // Write some data to the worksheet.
    worksheet.write(1, 1, "Apple")?;
    worksheet.write(1, 3, 1.23456)?;

    worksheet.write_with_format(2, 3, 1.23456, &num_format)?;

    worksheet.write_with_format(3, 1, "Pear", &bold)?;

    worksheet.write(5, 1, "Orange")?;

    // Save the file to disk.
    workbook.save("test.xlsx")?;

    Ok(())
}

Trying to write data with an unregistered format would raise an XlsxError.

Help needed

If you have read this far, thanks. I'm looking for input for other ways of handling this.

One possible method would be to use an on-disk version of the internal BTreeMap container that stores all the worksheet cell information. This would probably be the least disruptive solution and would eliminate the need to work around the Table/Merged Range limitations listed above. There seems to be some existing implementations such as:

Another solution might be storing the cell information in an SQL/LITE DB.

Any other thoughts, suggestions?

zachbateman commented 4 days ago

First, thank you for all the work you put into this project! rust_xlsxwriter is a fantastic crate.

Most of the discussion here is over my head, but I'd prefer the most simple solution.

Your workbook.register_format(&format1); method seems like it would be okay but error prone with users likely to miss registering a format at times. Would not registering a format result in some form of error/feedback or just not apply the format in the file?

What if all formats were registered at one spot where the workbook is saved? Including the formats in the save call would remind users about this step and keep it confined to one location. Something like:

// If not using/registering any formats
workbook.save_constant_memory(path, None);

// If using formats, register them all at once
workbook.save_constant_memory(path, Some(vec![&format1, &format2, ...]));

Or, what if the Workbook object did keep track of used Format objects so that whenever a Format is applied, the Workbook internally registers it and uses the same id for matching formats?

jmcnamara commented 4 days ago

@zachbateman Thanks for the feedback and the questions. First let me clarify one important part that I left out of the explanation above. I'll update the introduction with the new information.

Each unique format in the workbook (with a unique combination of properties such as bold, italic, number format, color) must have a unique ID. And, in the "constant_memory" mode case that unique ID must be known at the time of writing the data so that the it can be added to the cell XML element like the s="1" (style == 1) in the above example:

      <c r="D3" s="1">
        <v>1.23456</v>
      </c>

Therefore in the case where the data is written when the user calls write() the format ID can't be generated when the Format is created (since the final properties of the format and it uniqueness are unknown) or at save() since that is too late. It is also worth noting that calculating the IDs at save() time is what the library does at the moment in the non "constant memory" case.

Would not registering a format result in some form of error/feedback or just not apply the format in the file?

Yes! Trying to call write_with_format() with an unregistered format would raise an error.

I'll update the discussion above with some examples to help clarify how it would work in practice.

han1548772930 commented 4 days ago

The registration method is good. If you don't need to order and the key is a number, you can use nohash-hasher, which should be faster.

jmcnamara commented 4 days ago

@han1548772930

If you don't need to order and the key is a number, you can use nohash-hasher, which should be faster.

Thanks for that. The hash key needs to be made up from all the format properties so the simple no_hash method won't work. Anyway, there is already a Format hash function in the code and it is fairly efficient (even though it won't be called very often in this use case):

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/format.rs#L454

The register_format() was also in the code base but I removed it last year. So that part is also ready to go:

https://github.com/jmcnamara/rust_xlsxwriter/commit/dcc3a3d084efd46e316188c80ef63226c7d67a42#diff-4ea682f1327d5715d346afbd6ceac182cb6c8c92ada03a8cc995255b7253164dL715-L737

In fact most of the internal plumbing is there. However, before going down that route (which will work since the Python/C/Perl version use it) I'm sounding people out in case there is a better way of handling this.

zachbateman commented 3 days ago

Using register_format() and raising an error that clearly describes the issue while in constant_memory mode seems good.

For me, this would be an optimization that I'd use if specifically needed for a large file, and in that case I'd already be looking deeper into the docs. Adding register_format() calls that are checked with clear error messages wouldn't be a problem.