More ideas to improve performance

adriandelgado commented 1 year ago

Using a real world app (used in my job), I identified some functions that can be optimized. These problems don't show up in the current benchmark because it doesn't use these parts of the codebase.

`*_key` functions

A way of optimizing those functions cloud be to remove them and then I following the compiler errors. The Format struct can be refactored to group the "alignment" fields into a separate struct, same with "borders" and "font". Then we can implement the Hash trait. On those structs, including the Format struct, we can then directly use those structs as keys of HashMaps. The current system needs to format and allocate huge strings composed of other huge strings.

I did it all of this on my local fork but I know this is something that you would want to do personally because it needs changes across several files. It also required some manual deriving of Hash because of an f64. Some parts of the ordered-float crate can be vendored to solve that issue.

`Worksheet::set_name`

Constructing the regex takes a lot of time. This is a very quick fix, we can use lazy_static to "memoize" the regex. However, an even better change would be to remove the regex altogether. We can use something like:

let invalid_characters = br"[]:*?/\";
if name
    .as_bytes()
    .iter()
    .any(|ch| invalid_characters.contains(ch))
{
    // ...
}

Maybe invalid_characters should be a constant, that's an stylistic choice.

adriandelgado commented 1 year ago

In the app I am working, Those two functions take approx. 60% of the time before Worksheet::save.

jmcnamara commented 1 year ago

I'll address these two separately below.

jmcnamara commented 1 year ago

`Worksheet::set_name`

At a minimum this should be enclosed in a lazy_static (side-rant: regex or rust should support a compile option for regexes). I think this is the only one I missed outside of test code (which I should probably fix as well).

However, as you point out, this doesn't need to be a regex and contains() would work just as well/better. In this case I would probably write it something like this:

        if name.contains(['[', ']', ':', '*', '?', '/', '\\']) {
            return Err(XlsxError::SheetnameContainsInvalidCharacter(
                name.to_string(),
            ));
        }
}

I'll go ahead and make that change if you don't see anything wrong with it.

BTW, how many worksheets were you adding that this showed up in your performance testing, or was it just the raw unwrapped regex that caused the issue even for a small number of worksheets.

jmcnamara commented 1 year ago

`Format::*_key` functions

This one I did think about in advance.

Some context to help the discussion. You are probably aware of some of this already from looking at the code.

When you write formatted data to an xlsx worksheet file it contains a reference to the "style" id, which is 2 in this example:

      <c r="A1" s="2">
        <v>123</v>
      </c>

This id needs to be shared across the workbook (i.e., all worksheets) to indicate a unique format. Excel also does further size/repeating optimizations to divide out shared fonts, borders, alignments so that they are only stored once.

So in summary each unique format needs to have a unique id and the sub properties need to be unique as well.

Anyway, in relation to the implementation, in all the other language implementations Python/Perl/C the Format objects are created from and owned by the parent Workbook. Whenever a format object is used its workbook-wide unique id is calculated and stored so that any subsequent call will just return the id without calculation.

However, that wasn't possible in the Rust version without severely limiting the borrowing of formats. As such in the Rust version the formats are independent (this is also to keep the door open for applying the formats separately to the data which is a frequently requested feature).

For the rust version the scheme is to convert the format to a unique worksheet-local id and then at save time remap them to worksheet-unique ids. I also need to allow for the use case when a format is used in more than one workbook and has a different format id for each.

I did envisage that this might cause a performance overhead further down the line and I did have some ideas. Also, at some point when/if people run into memory issues with large worksheets I'll need to add a contant_memory mode/feature to reduce it and that will need to have a scheme of working out formatting ids as the data is written/stored and not when the entire files is saved. This is also why there is this undocumented register_format() workbook method lying in wait:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/workbook.rs#L717

Then we can implement the Hash trait.

Performance aside I think I would like to do that anyway for cleaner/idiomatic code. There may be fields that would have to be omitted from the hash but I presume that is possible (?).

It also required some manual deriving of Hash because of an f64.

That value could be converted to and stored as a string once it passed the API. It's f64 value isn't actually required.

Some ideas I had to amortize the hash calculation:

My initial idea (in very early implementations) was to have each Worksheet have a UUID and for the format to store ids relative to that worksheet. I still may have to do something like that for the contant_memory mode feature.
Only recalculate the format object hash value if one of the properties has changed. Otherwise return the previous/current calculated value. This would mean adding something ugly like self.has_changed = true to each property setter but it would effectively reduce the hash calculations down to one per unique format. I'll probably implement something like that anyway.

BTW, what performance improvements did you get, roughly, for your changes?

adriandelgado commented 1 year ago

I'll go ahead and make that change if you don't see anything wrong with it.

+1

BTW, how many worksheets were you adding that this showed up in your performance testing, or was it just the raw unwrapped regex that caused the issue even for a small number of worksheets.

The program writes between 6 and 10 worksheets per file but it writes hundreds of files. I use 6 unique formats in total in the whole app.

adriandelgado commented 1 year ago

Performance aside I think I would like to do that anyway for cleaner/idiomatic code. There may be fields that would have to be omitted from the hash but I presume that is possible (?).

If you want to omit fields you would have to manually derive the trait (not hard to do), otherwise you can just use #[derive(Hash, PartialEq, Eq)]. Using the Hash trait is indeed the idiomatic way to use a struct as a key in a HashMap. Check out how I implemented these changes: https://github.com/adriandelgado/rust_xlsxwriter/commit/c9655f77bd74665d00e2bd45adcff002330f40d2

That value could be converted to and stored as a string once it passed the API. It's f64 value isn't actually required.

+1

Another thing to add is that I have a strong suspicion that you can join Workbook::xf_formats and Workbook::xf_indices into a single data structure after these changes, but I'm not sure about that (not familiar with the codebase enough).

adriandelgado commented 1 year ago

I didn't use hyperfine because this program does a lot of things, so I used good old Instant::elapsed and I looped around some functions 100 times (the functions before saving). I get improvements from 35% to 40% by removing the *_key functions (I didn't remove the Regex).

You can do your own testing by using this in your Cargo.toml

rust_xlsxwriter = { git = "https://github.com/adriandelgado/rust_xlsxwriter", branch = "remove-key-functions" }

jmcnamara commented 1 year ago

Worksheet::set_name

This part is fixed on main.

jmcnamara commented 1 year ago

If you want to omit fields you would have to manually derive the trait (not hard to do), otherwise you can just use #[derive(Hash, PartialEq, Eq)]. Using the Hash trait is indeed the idiomatic way to use a struct as a key in a HashMap. Check out how I implemented these changes: adriandelgado@c9655f7

Will do. Thanks.

Only recalculate the format object hash value if one of the properties has changed.

I'll probably make this change anyway and check what the performance delta is. I'll make the hash change after that.

jmcnamara commented 1 year ago

Check out how I implemented these changes: adriandelgado@c9655f7

I reviewed it and I don't think I can improve on it so please submit this as a PR and I'll merge it.

jmcnamara commented 1 year ago

Closing. Both changes are now on main. Thanks.

jmcnamara / rust_xlsxwriter