Ideas for perf. improvements

adriandelgado commented 1 year ago

I spent the weekend benchmarking, refactoring and optimizing the crate without changing the public API and got improvements of around 10%. However I didn't split the changes into several commits so in the course of the following week I'm going to send PRs that are easier to review. Below are some of the changes I made. You can give feedback on any of them.

Throughout several files

Change vec! to static arrays whenever possible
Use the Entry API whenever posible. This avoids the .get(key) -> .insert(key,value) pattern which needs to double check the map.
If a value's size is less than or equal to 8 bytes, it's cheaper to copy than to reference. So I removed those references.

`xmlwriter.rs`

Changed the parameter type of attributes from &Vec<(&str, String)> to &[(&str, &str)], this implied a LOT of changes thoughout the codebase and a bit of borrow checker fighting but it removed a lot of unnecessary allocations in the form of .to_string(), vec! and attributes.push().
Slight refactor of the escape_string function to use str::char_indices instead of .chars().enumerate().
Removed some unnecessary allocations inside the functions.

`worksheet.rs`

Changed Worksheet::table from a HashMap to a BTreeMap for noticeable performance improvements and needed almost no other changes to the rest of the codebase.
Using the Entry API inside Worksheet::col_to_name had noticeable perf. improvements.
Changed use of HashSet to .sort_unstable() -> .dedup() inside process_pagebreaks.
Refactor the Worksheet::get_cache_data function. This didn't have perf. implications but it reduced the heavy indentation and the lines of code without sacrificing clarity.
Refactor the Worksheet::write_data_table and Worksheet::calculate_spans. We don't need to index the BTreeMaps because they are ordered already. We can cheaply iterate. This had very noticeable perf. improvements.
Change Option<&String> to Option<&str> in the Worksheet::write_row function.
Removed some unnecesary allocations inside the write_*_cell functions.
Bribed the borrow checker to remove some of the .clone().iter() and just directly iterate.
Changed the CellType enum. Now it uses Box<str> instead of String. Before: 80 bytes, 8 alignment. After: 56 bytes 6 alignment. This improves cache locality and allows the compiler to better optimize its use. This had noticeable perf. improvements and needed almost no other changes to the rest of the codebase.

`format.rs`

Changed the output type of XlsxColor::chart_scheme from (String, u32, u32) to (&'static str, u32, u32)

`filter.rs`

Changed the output type of FilterCriteria::operator from String to &'static str

`chart.rs`

Added an as_str method that outputs an &'static str to ChartMarkerType, ChartAxisPosition, ChartGrouping, ChartLegendPosition, ChartLineDashType, ChartPatternFillType. Then I refactored the body of the ToString::to_string implementations to self.as_str().to_string(). I think it would be preferable to delete the ToString::to_string implementations but technically those are part of the public API.

Other ideas

Using the deflate-zlib feature flag has huge performance benefits at the cost of not being pure Rust.

adriandelgado commented 1 year ago

Some of the changes didn't get reflected on the benchmark because it only handles strings and numbers. It would be nice to have a more comprehensive example. This doesn't include using deflate-zlib feature flag on the zip crate.

jmcnamara commented 1 year ago

I spent the weekend benchmarking, refactoring and optimizing the crate without changing the public API and got improvements of around 10%.

Thanks for the good work. Performance is very important to the library, and me, so I am open to these changes.

As you probably already figures out the most most performance gains come from the inner loop within Worksheet::write_data_table and anything in xmlwriter.rs. So they should probably be the focus of the first round of optimizations..

Some of the changes didn't get reflected on the benchmark because it only handles strings and numbers.

That is intentional because that is where the performance will affect most users. That is not to say that performance in other parts of the code shouldn't be fixed (especially if there are instances of bad performance) but in the first phase it will be better to target optimizations that move the benchmark by ~1+%.

One aspect that can affect users and that isn't covered by the above benchmark is writing formatted data. That would be worth looking at as well, but again after the first phase.

BTW, are you using any particular profiling tool?

I'll make some general comments on the other suggestions below.

One other comment. I prefer that a PR of anything other than a minor change starts as a conversation. That is why the contributors guide asks that an Issue is opened first before a PR. This is a good example of that and the previous clippy PR was a bad example (it contained good fixes mixed in with questionable ones and in the end was easier for me to fix myself, which was a loss of time for both of us). If you become a regular contributor/communicator then that requirement can be ignored. But for now please start with some communication, some discussion and then the PR.

jmcnamara commented 1 year ago

Use the Entry API whenever posible. This avoids the .get(key) -> .insert(key,value) pattern which needs to double check the map.

Agreed. That is an idiom that I need to adopt.

If a value's size is less than or equal to 8 bytes, it's cheaper to copy than to reference. So I removed those references.

Clippy::pedantic flagged that and there are some &u32 parameters in the inner loop. I presume that will help somewhat but I'd like to see the perf delta.

xmlwriter.rs

Changed the parameter type of attributes from &Vec<(&str, String)> to &[(&str, &str)], this implied a LOT of changes thoughout the codebase and a bit of borrow checker fighting but it removed a lot of unnecessary allocations in the form of .to_string(), vec! and attributes.push().

The codebase was actually like that until 4869d85ea020316c3a22f900b1af660479f24a29. I changed it because I was spending a lot of development time/code using as_str() for any attribute that could change. I didn't see any/much of a perf delta at the time.

Slight refactor of the escape_string function to use str::char_indices instead of .chars().enumerate().

Sounds okay.

Removed some unnecessary allocations inside the functions.

Sounds okay.

worksheet.rs

Changed Worksheet::table from a HashMap to a BTreeMap for noticeable performance improvements and needed almost no other changes to the rest of the codebase.

That is interesting. I did briefly look at doing that but forget why I didn't make the change. A BTreeMap is a better model of the worksheet data anyway and I use it in several other place. So +1 for this.

Using the Entry API inside Worksheet::col_to_name had noticeable perf. improvements.

Anything there will have a positive improvement everywhere. +1

Changed use of HashSet to .sort_unstable() -> .dedup() inside process_pagebreaks.

This is an example of a change that almost isn't worth since set_page_breaks() isn't used very often and isn't on the critical path. However, I am open to better idiomatic changes if they make sense. This should probably be a phase 2 item though.

Refactor the Worksheet::get_cache_data function. This didn't have perf. implications but it reduced the heavy indentation and the lines of code without sacrificing clarity.

I'll look at the change but it is probably a Phase 2 item.

Refactor the Worksheet::write_data_table and Worksheet::calculate_spans. We don't need to index the BTreeMaps because they are ordered already. We can cheaply iterate. This had very noticeable perf. improvements.

+1

Change Option<&String> to Option<&str> in the Worksheet::write_row function.

Sounds okay.

Removed some unnecesary allocations inside the write_*_cell functions.

+1

Bribed the borrow checker to remove some of the .clone().iter() and just directly iterate.

I hate having to use that .clone().iter() syntax but I get caught with interprocedural conflicts from time to time. As far as I remember none of these are on the fast/critical path. I'm open to looking at a proposal for one case first to see if it is worth while.

Changed the CellType enum. Now it uses Box<str> instead of String. Before: 80 bytes, 8 alignment. After: 56 bytes 6 alignment. This improves cache locality and allows the compiler to better optimize its use. This had noticeable perf. improvements and needed almost no other changes to the rest of the codebase.

That sounds like the type of performance improvement I like/understand. :-) +1

format.rs

Changed the output type of XlsxColor::chart_scheme from (String, u32, u32) to (&'static str, u32, u32)

Probably not worth it.

filter.rs

Changed the output type of FilterCriteria::operator from String to &'static str

Probably not worth it.

chart.rs

Added an as_str method that outputs an &'static str to ChartMarkerType, ChartAxisPosition, ChartGrouping, ChartLegendPosition, ChartLineDashType, ChartPatternFillType. Then I refactored the body of the ToString::to_string implementations to self.as_str().to_string(). I think it would be preferable to delete the ToString::to_string implementations but technically those are part of the public API.

I'd need to see this change but it's probably not worth it.

Other ideas

Using the deflate-zlib feature flag has huge performance benefits at the cost of not being pure Rust.

+2 especially if it can be done as a feature in rust_xlsxwriter.

adriandelgado commented 1 year ago

I use flamegraph to profile. There are more advanced tools but maybe for later. I'm following the advice of the perf book.

jmcnamara commented 1 year ago

Changed the parameter type of attributes from &Vec<(&str, String)> to &[(&str, &str)], this implied a LOT of changes thoughout the codebase and a bit of borrow checker fighting but it removed a lot of unnecessary allocations in the form of .to_string(), vec! and attributes.push().

I'm going to tackle the &Vec<(&str, String)> to &[(&str, &str)] change myself since, as you say, it will cause a lot of churn. I presume you also intended to change any of the the non-mutable let attributes = vec![..] to let attributes =[...] as well.

adriandelgado commented 1 year ago

I presume you also intended to change any of the the non-mutable let attributes = vec![..] to let attributes =[...] as well.

Yep. That's how it already is in my laptop's fork of the repo. I warn you that is going to be a lot of changes but no problem if you want to do it.

jmcnamara commented 1 year ago

I warn you that is going to be a lot of changes but no problem if you want to do it.

I'm most of the way through it. I see what you meant about the ToString implementations. They aren't very useful when I need to turn them back into &strs again. So that will require some refactoring as well. Anyway.

jmcnamara commented 1 year ago

@adriandelgado

I'm starting to remember why I moved away from &[(&str, &str)] in the first place. :-) I keep needing to create a temp String variable in order to use as_str() on it. Is this really worth it from a performance point of view?

adriandelgado commented 1 year ago

Is this really worth it from a performance point of view?

Not in the current benchmark, the improvements happen on other functions. There were lots of places were I saw several .to_string(). I'm sure there are less .as_str() added compared to the reduction of .to_string()s. Most of the attributes are static strs.

Maybe this is a change we can postpone for when the crate has feature parity with the python one. I understand if you don't want to implement this change while the code is still in flux.

jmcnamara commented 1 year ago

I understand if you don't want to implement this change while the code is still in flux.

If I am going to do it, at all, then I will have to do it now. It will get harder with time.

There were lots of places were I saw several .to_string(). I'm sure there are less .as_str()

The problem is that something like this:

    fn write_page_margins(&mut self) {
        let attributes = vec![
            ("b", "0.75".to_string()),
            ("l", "0.7".to_string()),
            ("r", "0.7".to_string()),
            ("t", "0.75".to_string()),
            ("header", "0.3".to_string()),
            ("footer", "0.3".to_string()),
        ];

        self.writer.xml_empty_tag_attr("c:pageMargins", &attributes);
    }

Looks wasteful compared to:

    fn write_page_margins(&mut self) {
        let attributes = vec![
            ("b", "0.75"),
            ("l", "0.7"),
            ("r", "0.7"),
            ("t", "0.75"),
            ("header", "0.3"),
            ("footer", "0.3"),
        ];

        self.writer.xml_empty_tag_attr("c:pageMargins", &attributes);
    }

However, as the API fills out these variables need to be dynamic and the first one becomes something like:

    fn write_page_margins(&mut self) {
        let attributes = vec![
            ("left", self.margin_left.to_string()),
            ("right", self.margin_right.to_string()),
            ("top", self.margin_top.to_string()),
            ("bottom", self.margin_bottom.to_string()),
            ("header", self.margin_header.to_string()),
            ("footer", self.margin_footer.to_string()),
        ];

        self.writer.xml_empty_tag_attr("pageMargins", &attributes);
    }

Whereas the second one would need to become something like this to:

    fn write_page_margins(&mut self) {
        let margin_left = self.margin_left.to_string();
        let margin_right = self.margin_right.to_string();
        let margin_top = self.margin_top.to_string();
        let margin_bottom = self.margin_bottom.to_string();
        let margin_header = self.margin_header.to_string();
        let margin_footer = self.margin_footer.to_string();

        let attributes = vec![
            ("left", margin_left.as_str()),
            ("right", margin_right.as_str()),
            ("top", margin_top.as_str()),
            ("bottom", margin_bottom.as_str()),
            ("header", margin_header.as_str()),
            ("footer", margin_footer.as_str()),
        ];

        self.writer.xml_empty_tag_attr("pageMargins", &attributes);
    }

Which now has the same number of allocations and is more cumbersome to write.

There will always be a certain number of "static" attribute values but I'm starting to think (again) that it is a lesser evil to allocate them for the benefit of easier development. Am I missing something obvious? Would you care to share this part of your PR for comparison.

adriandelgado commented 1 year ago

However, as the API fills out these variables need to be dynamic

That is something I didn't considered. I had the impression the change was worth it because most values were static. You convinced me of leaving it as &[(&str, String)]. It would be nice to leave a comment explaining the reasoning for any other person trying to optimize that part of the code.

However... those vec! can be turned into arrays in a lot of places.

jmcnamara commented 1 year ago

However... those vec! can be turned into arrays in a lot of places.

Agreed. I'll definitely do that.

adriandelgado commented 1 year ago

I'm also having some ideas for some changes that reduce allocations while also improving the dev. experience. I will propose those in a month maybe. That String will turn into &str eventually haha.

I'm using this library at work and we are thankful for your good work!

jmcnamara commented 1 year ago

That String will turn into &str eventually haha.

Lol. It is the way.

I'm using this library at work and we are thankful for your good work!

That's great. Thank you for the optimizations and the more idiomatic code.

jmcnamara commented 1 year ago

Using the deflate-zlib feature flag has huge performance benefits at the cost of not being pure Rust.

Wow. I just tried this and it is a huge boost:

Benchmark 1: ./app_perf_test_prev
  Time (mean ± σ):     232.9 ms ±   1.8 ms    [User: 216.3 ms, System: 14.5 ms]
  Range (min … max):   231.3 ms … 236.6 ms    12 runs

Benchmark 2: target/release/examples/app_perf_test
  Time (mean ± σ):     144.4 ms ±   2.7 ms    [User: 127.9 ms, System: 14.2 ms]
  Range (min … max):   141.0 ms … 153.7 ms    19 runs

Summary
  'target/release/examples/app_perf_test' ran
    1.61 ± 0.03 times faster than './app_perf_test_prev'

Which make is faster than the C version (which also used zlib):

Benchmark 1: ../perf/c_perf_test
  Time (mean ± σ):     196.0 ms ±   1.3 ms    [User: 163.4 ms, System: 26.9 ms]
  Range (min … max):   194.4 ms … 198.0 ms    15 runs

Benchmark 2: target/release/examples/app_perf_test
  Time (mean ± σ):     143.9 ms ±   3.0 ms    [User: 127.6 ms, System: 13.8 ms]
  Range (min … max):   140.7 ms … 155.3 ms    20 runs

Summary
  'target/release/examples/app_perf_test' ran
    1.36 ± 0.03 times faster than '../perf/c_perf_test'

Wow.

adriandelgado commented 1 year ago

That means this crate is now officially the fastest way to write xlsx files! 🚀🚀🚀

jmcnamara commented 1 year ago

That means this crate is now officially the fastest way to write xlsx f

Lol. It probably means I should be optimizing the C library more. ;-)

I added a cargo feature to optionally enable "deflate-zlib" under the same feature flag. For example here is a default compilation and a "deflate-zlib" compilation using the current main:


$ cargo build --release --example app_perf_test
$ mv target/release/examples/app_perf_test app_perf_test_no_zlib

$ cargo build --release --example app_perf_test --features deflate-zlib
$ mv target/release/examples/app_perf_test app_perf_test_with_zlib

$ hyperfine ./app_perf_test_with_zlib ./app_perf_test_no_zlib  --warmup 3

Benchmark 1: ./app_perf_test_with_zlib
  Time (mean ± σ):     141.5 ms ±   1.8 ms    [User: 127.0 ms, System: 12.8 ms]
  Range (min … max):   138.4 ms … 145.8 ms    20 runs

Benchmark 2: ./app_perf_test_no_zlib
  Time (mean ± σ):     230.3 ms ±   1.7 ms    [User: 215.4 ms, System: 12.7 ms]
  Range (min … max):   227.4 ms … 232.5 ms    12 runs

Summary
  './app_perf_test_with_zlib' ran
    1.63 ± 0.02 times faster than './app_perf_test_no_zlib'

I might rename the rust_xlsxwriter feature to faster-zip or something like that to make it more obvious to the end users.

adriandelgado commented 1 year ago

I might rename the rust_xlsxwriter feature to faster-zip or something like that

The idiomatic way would be to name it "zlib" or something similar. That's why a bunch of crates have a "serde" feature or a "chrono" or "rayon". The user of the crate would know which extra dependency is added.

Then, in the main doc page and in the readme you can have a section called "Features", explaining each of its purposes.

jmcnamara commented 1 year ago

One interesting thing to note is the the deflate-zlib speedup is almost negligible for small files:

$ hyperfine ./app_chart_column_with_zlib ./app_chart_column_no_zlib  --warmup 3
Benchmark 1: ./app_chart_column_with_zlib
  Time (mean ± σ):       8.4 ms ±   0.5 ms    [User: 4.2 ms, System: 3.2 ms]
  Range (min … max):     7.6 ms …  10.5 ms    244 runs

Benchmark 2: ./app_chart_column_no_zlib
  Time (mean ± σ):       8.6 ms ±   0.7 ms    [User: 4.3 ms, System: 3.3 ms]
  Range (min … max):     7.6 ms …  12.3 ms    235 runs

Summary
  './app_chart_column_with_zlib' ran
    1.03 ± 0.10 times faster than './app_chart_column_no_zlib'

jmcnamara commented 1 year ago

I'm also having some ideas for some changes that reduce allocations while also improving the dev. experience.

One straightforward workaround would be to have a separate set of functions (there aren't many) that accept &[(&str, &str)]. That way I can start development with stack based string literals and then move to Strings as needed. Or, performance allowing, we could use a Into style Trait for the attributes parameter that takes either &[(&str, &str)] or &[(&str, String)] and that way I can move seamlessly from &str to String.

jmcnamara commented 1 year ago

Closing the overall issue since the majority of the features are in v0.34.0 on crates.io.

jmcnamara commented 1 year ago

@adriandelgado

Or, performance allowing, we could use a Into style Trait for the attributes parameter that takes either &[(&str, &str)] or &[(&str, String)] and that way I can move seamlessly from &str to String.

I went ahead and implemented this and the performance is the same as previously. I added a performance file that doesn't write any data but has a fair amount of XML generation to try exercise xmlwriter.rs a bit more.

I then converted all the non-dynamic "str".to_string() allocations to a static string literal. That gave the same performance but it is cleaner and easier on the code-reviewing eye.

I reinstated some of the write_all() vs write!() changes you proposed. The performance was still the same but on balance it is probably better. At least it suggests we/you thought about it. :-)

Finally I removed the reserve(approx_len) calls since they didn't seem to affect performance one way or the other. The underlying "file" is a Cursor with a 2k preallocation so it probably wasn't needed.

adriandelgado commented 1 year ago

Makes sense. That's why they say intuitions around performance are always wrong haha.

Thank you for taking time to deeply examine those proposals.

jmcnamara commented 1 year ago

Note to self. Tried the following with no significant improvement:

diff --git a/src/worksheet.rs b/src/worksheet.rs
index 3124f0b..313fa70 100644
--- a/src/worksheet.rs
+++ b/src/worksheet.rs
@@ -46,6 +46,7 @@ const MAX_PARAMETER_LEN: usize = 255;
 const DEFAULT_COL_WIDTH: f64 = 8.43;
 const DEFAULT_ROW_HEIGHT: f64 = 15.0;
 pub(crate) const NUM_IMAGE_FORMATS: usize = 5;
+const COLUMN_LETTERS: &str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

 /// The Worksheet struct represents an Excel worksheet. It handles operations
 /// such as writing data to cells or formatting the worksheet layout.
@@ -7700,9 +7701,13 @@ impl Worksheet {

     // Cached/faster version of utility.col_to_name() to use in the inner loop.
     fn col_to_name(col_names: &mut HashMap<u16, String>, col_num: ColNum) -> &str {
-        col_names
-            .entry(col_num)
-            .or_insert_with(|| utility::col_to_name(col_num))
+        if col_num < 26 {
+            &COLUMN_LETTERS[col_num as usize..(col_num + 1) as usize]
+        } else {
+            col_names
+                .entry(col_num)
+                .or_insert_with(|| utility::col_to_name(col_num))
+        }
     }

     // Store local copies of unique formats passed to the write methods. These

jmcnamara / rust_xlsxwriter