feature request: Can use multithreading?

han1548772930 commented 1 year ago

Feature Request

First of all, I would like to thank the author for providing a very useful library. Is it possible to speed up the export if using multithreading

han1548772930 commented 1 year ago

I tried using multithreading to handle this, but I found it to be slower than single threading.


use std::{
    sync::{Arc, Mutex},
    thread,
    time::{SystemTime, UNIX_EPOCH},
};

use rust_xlsxwriter::*;

fn main() {
    let workbook = Workbook::new();
    let workbook_arc = Arc::new(Mutex::new(workbook));
    workbook_arc.lock().unwrap().add_worksheet();

    let mut time = timestamp1();
    println!("start{:?}", time);
    let mut handles = vec![];
    for i in 1..6 {
        let workbook_clone = workbook_arc.clone();
        let handle = thread::spawn(move || {
            let mut workbook = workbook_clone.lock().unwrap();
            let sheet = workbook.worksheet_from_index(0).unwrap();
            for j in (i - 1) * 209715..i * 209715 {
                sheet.write_string(j, 1, "Hello, World!").unwrap();
                sheet.write_string(j, 2, "Hello, World!").unwrap();
                sheet.write_string(j, 3, "Hello, World!").unwrap();
                sheet.write_string(j, 4, "Hello, World!").unwrap();
                sheet.write_string(j, 5, "Hello, World!").unwrap();
                sheet.write_string(j, 6, "Hello, World!").unwrap();
                sheet.write_string(j, 7, "Hello, World!").unwrap();
                sheet.write_string(j, 8, "Hello, World!").unwrap();
                sheet.write_string(j, 9, "Hello, World!").unwrap();
                sheet.write_string(j, 10, "Hello, World!").unwrap();
                sheet.write_string(j, 11, "Hello, World!").unwrap();
                sheet.write_string(j, 12, "Hello, World!").unwrap();
            }
        });
        handles.push(handle);
    }
    for handle in handles {
        handle.join().unwrap();
    }
    let mut workbook = workbook_arc.lock().unwrap();
    workbook.save("demo.xlsx").unwrap();
    time = timestamp1();
    println!("end{:?}", time);
}

fn timestamp1() -> i64 {
    let start = SystemTime::now();
    let since_the_epoch = start
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards");
    let ms = since_the_epoch.as_secs() as i64 * 1000i64
        + (since_the_epoch.subsec_nanos() as f64 / 1_000_000.0) as i64;
    ms
}

jmcnamara commented 1 year ago

I’ll add multi-threading into the back end in the next release +1 or +2 release.

The library is probably IO bound rather than CPU bound so multi-threading may not have a linear benefit. Nonetheless I’ll implement it to get whatever possible benefit.

@adriandelgado Any suggestions to the OP on multi-threading in the front end/user app?

adriandelgado commented 1 year ago

Multitheading is only useful for massive Worksheets.

I also recommend not using a Mutex. You can generate each Worksheet on a separate thread and then join together using push_worksheet.

han1548772930 commented 1 year ago

I tried using these two methods and still got something similar to single threading.

fn main() {
    let workbook = Workbook::new();
    let workbook_arc = Arc::new(Mutex::new(workbook));
    // workbook_arc.lock().unwrap().add_worksheet();
    // workbook_arc.lock().unwrap().add_worksheet();
    // workbook_arc.lock().unwrap().add_worksheet();
    // workbook_arc.lock().unwrap().add_worksheet();

    let mut time = timestamp1();
    println!("start{:?}", time);
    let mut handles = vec![];
    for i in 0..4 {
        let workbook_clone = workbook_arc.clone();
        let handle = thread::spawn(move || {
            let mut workbook = workbook_clone.lock().unwrap();
            // let sheet: &mut Worksheet = workbook.worksheet_from_index(i).unwrap();
            let mut sheet: Worksheet=Worksheet::new();
            for j in 0..1048576 {
                sheet.write_string(j, 1, "Hello, World!").unwrap();
                sheet.write_string(j, 2, "Hello, World!").unwrap();
                sheet.write_string(j, 3, "Hello, World!").unwrap();
                sheet.write_string(j, 4, "Hello, World!").unwrap();
                sheet.write_string(j, 5, "Hello, World!").unwrap();
                sheet.write_string(j, 6, "Hello, World!").unwrap();
                sheet.write_string(j, 7, "Hello, World!").unwrap();
                sheet.write_string(j, 8, "Hello, World!").unwrap();
                sheet.write_string(j, 9, "Hello, World!").unwrap();
                sheet.write_string(j, 10, "Hello, World!").unwrap();
                sheet.write_string(j, 11, "Hello, World!").unwrap();
                sheet.write_string(j, 12, "Hello, World!").unwrap();
            }
            workbook.push_worksheet(sheet);
        });
        handles.push(handle);
    }
    for handle in handles {
        handle.join().unwrap();
    }
    let mut workbook = workbook_arc.lock().unwrap();
    workbook.save("demo.xlsx").unwrap();
    time = timestamp1();
    println!("end{:?}", time);
}

fn timestamp1() -> i64 {
    let start = SystemTime::now();
    let since_the_epoch = start
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards");
    let ms = since_the_epoch.as_secs() as i64 * 1000i64
        + (since_the_epoch.subsec_nanos() as f64 / 1_000_000.0) as i64;
    ms
}

fn main() {
    task::block_on(async {
        let mut time = timestamp1();
        println!("start:{:?}", time);
        let mut workbook: Workbook = Workbook::new();

        let res = async_main().await;
        workbook.push_worksheet(res.0);
        workbook.push_worksheet(res.1);
        workbook.push_worksheet(res.2);
        workbook.push_worksheet(res.3);
        workbook.save("demo.xlsx").unwrap();
        time = timestamp1();
        println!("end:{:?}", time);
    });
}
async fn async_main() -> (Worksheet, Worksheet, Worksheet, Worksheet) {
    let f1 = write_data();
    let f2 = write_data();
    let f3 = write_data();
    let f4 = write_data();
    let res: (Worksheet, Worksheet, Worksheet, Worksheet) = futures::join!(f1, f2, f3, f4);
    res
}
fn timestamp1() -> i64 {
    let start = SystemTime::now();
    let since_the_epoch = start
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards");
    let ms = since_the_epoch.as_secs() as i64 * 1000i64
        + (since_the_epoch.subsec_nanos() as f64 / 1_000_000.0) as i64;
    ms
}
async fn write_data() -> Worksheet {
    let mut sheet: Worksheet = Worksheet::new();
    for j in 1..1048576 {
        sheet.write_string(j, 0, "Hello, World!").unwrap();
        sheet.write_string(j, 1, "Hello, World!").unwrap();
        sheet.write_string(j, 2, "Hello, World!").unwrap();
        sheet.write_string(j, 3, "Hello, World!").unwrap();
        sheet.write_string(j, 4, "Hello, World!").unwrap();
        sheet.write_string(j, 5, "Hello, World!").unwrap();
        sheet.write_string(j, 6, "Hello, World!").unwrap();
        sheet.write_string(j, 7, "Hello, World!").unwrap();
        sheet.write_string(j, 8, "Hello, World!").unwrap();
        sheet.write_string(j, 9, "Hello, World!").unwrap();
        sheet.write_string(j, 10, "Hello, World!").unwrap();
        sheet.write_string(j, 11, "Hello, World!").unwrap();
    }
    sheet
}

han1548772930 commented 1 year ago

After some testing, I found that the write_string data is very fast, but it will take a long time to save. Is it possible to make save_internal asynchronous

jmcnamara commented 1 year ago

Is it possible to make save_internal asynchronous

That is the plan.

I think the highest value bottleneck for parallelism would be the worksheet writing loop in packager.rs:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/packager.rs#L104-L110

        let mut string_table = SharedStringsTable::new();
        for (index, worksheet) in workbook.worksheets.iter_mut().enumerate() {
            self.write_worksheet_file(worksheet, index + 1, &mut string_table)?;
            if worksheet.has_relationships() {
                self.write_worksheet_rels_file(worksheet, index + 1)?;
            }
        }

The tricky(?) part would be to have mutex locked (or some other scheme) updates to the shared string table (which maps strings to an index value using Excel's scheme).

The self.write_worksheet_rels_file() part could probably move to a non-threaded loop.

@adriandelgado pointed out in #29 that there could be a lot of value in parallelising the zip writing. I don't know if that will be possible using the current zip crate.

jmcnamara commented 1 year ago

I've made a first pass at introducing threading into the back end of rust_xlsxwriter. The preliminary work in on branch threaded1. Some notes on this:

I've uses thread::scope instead of thread::spawn since that makes it easier to work with the lifetimes and "self escapes the method body here" warnings.
Access to the shared string table (SST) is mutex locked (as it needs to be be since the string table is meant to have unique entries for each repeated string).
It is prototype code only and not meant for production testing.
Comparison tests are failing due to string ordering. I'll fix those when this is a little less "work in progress".
I've kept the threading to the rust_xlsxwriter parts for now and not the zip parts to make the direct effects of the worksheet writing more obvious.

On the threaded1 there are 3 test cases:

examples/app_perf_test: Single worksheet with mixed string and number values.
examples/app_perf_test2: 4 worksheets for string data only.
examples/app_perf_test3: 4 worksheets for number data only.

From this I get mixed results:


$ hyperfine target/release/examples/app_perf_test_threaded target/release/examples/app_perf_test_unthreaded --warmup 3
Benchmark 1: target/release/examples/app_perf_test_threaded
  Time (mean ± σ):     244.8 ms ±  12.4 ms    [User: 221.3 ms, System: 16.9 ms]
  Range (min … max):   238.2 ms … 280.0 ms    12 runs

Benchmark 2: target/release/examples/app_perf_test_unthreaded
  Time (mean ± σ):     237.3 ms ±   1.1 ms    [User: 218.9 ms, System: 16.8 ms]
  Range (min … max):   235.6 ms … 239.5 ms    12 runs

Summary
  'target/release/examples/app_perf_test_unthreaded' ran
    1.03 ± 0.05 times faster than 'target/release/examples/app_perf_test_threaded'

$ hyperfine target/release/examples/app_perf_test2_threaded target/release/examples/app_perf_test2_unthreaded --warmup 3
Benchmark 1: target/release/examples/app_perf_test2_threaded
  Time (mean ± σ):      1.261 s ±  0.011 s    [User: 1.184 s, System: 0.905 s]
  Range (min … max):    1.247 s …  1.283 s    10 runs

Benchmark 2: target/release/examples/app_perf_test2_unthreaded
  Time (mean ± σ):     986.1 ms ±   6.9 ms    [User: 916.6 ms, System: 66.1 ms]
  Range (min … max):   977.7 ms … 997.0 ms    10 runs

Summary
  'target/release/examples/app_perf_test2_unthreaded' ran
    1.28 ± 0.01 times faster than 'target/release/examples/app_perf_test2_threaded'

$ hyperfine target/release/examples/app_perf_test3_threaded target/release/examples/app_perf_test3_unthreaded --warmup 3
Benchmark 1: target/release/examples/app_perf_test3_threaded
  Time (mean ± σ):     778.6 ms ±  20.2 ms    [User: 837.8 ms, System: 54.2 ms]
  Range (min … max):   766.5 ms … 832.6 ms    10 runs

Benchmark 2: target/release/examples/app_perf_test3_unthreaded
  Time (mean ± σ):     889.2 ms ±   4.1 ms    [User: 834.7 ms, System: 52.0 ms]
  Range (min … max):   884.7 ms … 895.8 ms    10 runs

Summary
  'target/release/examples/app_perf_test3_threaded' ran
    1.14 ± 0.03 times faster than 'target/release/examples/app_perf_test3_unthreaded'

Some observations from this:

The single worksheet case is ~ the same speed threaded and unthreaded as the mutex is uncontended.
The 4 x string worksheet threaded case is slower (~30%) due to the mutex contention (most likely - I need to do more testing).
the 4 x number worksheet threaded case is ~15% faster since the mutex is again uncontended.

There are some options to remove the mutex lock and contention:

Do a separate non-threaded pass of all the worksheet string data to build up the SST table.
Ignore the mutex and do non-atomic updates to the SST. This could lead to duplicates in the SST table but that isn't an error in Excel and would probably only happen in a very small number of cases anyway. But it is poor engineering.
Move to a rwlock and do initial non-locking reads to see if the string exists in the SST and only lock if it doesn't.

I'll look into some of these options in the next few days and I'll post some updates as I go.

jmcnamara commented 1 year ago

There are some options to remove the mutex lock and contention:

Option 2 isn't possible, as far as I can see, in Rust. Probably for good reasons.
Option3, I haven't looked at yet. It would be better for cases with repeated strings but probably a bit worse for the case of a lot of unique string data.

So for now I've gone with Option1 "Do a separate non-threaded pass of all the worksheet string data to build up the SST table." I've added a second prototype for this on the threaded2 branch.

Overall the results are good:

$ hyperfine target/release/examples/app_perf_test target/release/examples/app_perf_test_unthreaded --warmup 3
Benchmark 1: target/release/examples/app_perf_test
  Time (mean ± σ):     238.3 ms ±   2.5 ms    [User: 221.8 ms, System: 15.3 ms]
  Range (min … max):   234.8 ms … 244.2 ms    12 runs

Benchmark 2: target/release/examples/app_perf_test_unthreaded
  Time (mean ± σ):     236.4 ms ±   2.5 ms    [User: 220.0 ms, System: 15.0 ms]
  Range (min … max):   233.3 ms … 241.2 ms    12 runs

Summary
  'target/release/examples/app_perf_test_unthreaded' ran
1.01 ± 0.02 times faster than 'target/release/examples/app_perf_test'

$ hyperfine target/release/examples/app_perf_test2 target/release/examples/app_perf_test2_unthreaded --warmup 3
Benchmark 1: target/release/examples/app_perf_test2
  Time (mean ± σ):     919.2 ms ±  14.0 ms    [User: 924.3 ms, System: 63.5 ms]
  Range (min … max):   901.7 ms … 949.2 ms    10 runs

Benchmark 2: target/release/examples/app_perf_test2_unthreaded
  Time (mean ± σ):     980.1 ms ±  11.3 ms    [User: 915.7 ms, System: 61.1 ms]
  Range (min … max):   964.3 ms … 1000.4 ms    10 runs

Summary
  'target/release/examples/app_perf_test2' ran
    1.07 ± 0.02 times faster than 'target/release/examples/app_perf_test2_unthreaded'

$ hyperfine target/release/examples/app_perf_test3 target/release/examples/app_perf_test3_unthreaded --warmup 3
Benchmark 1: target/release/examples/app_perf_test3
  Time (mean ± σ):     794.1 ms ±  14.5 ms    [User: 856.9 ms, System: 50.8 ms]
  Range (min … max):   781.7 ms … 832.8 ms    10 runs

Benchmark 2: target/release/examples/app_perf_test3_unthreaded
  Time (mean ± σ):     887.7 ms ±   5.7 ms    [User: 837.8 ms, System: 46.9 ms]
  Range (min … max):   876.2 ms … 898.0 ms    10 runs

Summary
  'target/release/examples/app_perf_test3' ran
    1.12 ± 0.02 times faster than 'target/release/examples/app_perf_test3_unthreaded'

Summary:

The single worksheet case is ~1% slower but within the margin of error.
The 4 x string worksheet threaded case is ~ 7% faster.
the 4 x number worksheet threaded case is ~12% faster.

Not amazing but I'll take a 10% increase for the amount of work involved. If anyone could try the threaded2 branch against real code I'd be interested to see the results.

I'll move on to see what can be done with the zip writer parts.

han1548772930 commented 1 year ago

Wow, that's great!

jmcnamara commented 1 year ago

I'm going to merge the second option threaded2 onto main. I think it is the best I can do for now. There are still potential gains to be had from parallelizing the zipping but after an initial look I'm going to leave that to another time/person.

jmcnamara commented 1 year ago

I've pushed these changes to crates.io in v0.44.0. It is the best I can do for now. Hopefully it will inspire some other analysis/contributions.

Closing.

jmcnamara / rust_xlsxwriter

feature request: Can use multithreading? #49

Feature Request