tafia / calamine

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
MIT License
1.6k stars 155 forks source link

Multi-threaded .xlsx row loading #346

Open eliasjonsson023 opened 11 months ago

eliasjonsson023 commented 11 months ago

In a recent application of mine, 50,000 rows are read from an .xlsx-file. The part of reading the 50,000 rows is the bottle neck. Wouldn't it be nice to have multi-threaded reading of an .xlsx file? We do live in the multi-core CPU era after all.

jqnatividad commented 8 months ago

+1 on making loading faster. In an application I'm working on, I have a benchmark for loading 1m rows and then writing it to a file. https://qsv.dathere.com/benchmarks It takes ~12 seconds to load 1m rows and then write it to a CSV. The loading part took about 10 seconds, and writing to CSV takes less than 2 seconds.

RoloEdits commented 8 months ago

From the piz library, I found this comment:

/// Reads the given file from the ZIP archive.
///
/// Since each file in a ZIP archive is compressed independently,
/// multiple files can be read in parallel.

Which is what is needed as the structure of an .xlsx file is

 ./
├── [Content_Types].xml
├── _rels/
│  └── .rels
├── docProps/
│  ├── app.xml
│  └── core.xml
└── xl/
   ├── _rels/
   │  └── workbook.xml.rels
   ├── sharedStrings.xml
   ├── styles.xml
   ├── theme/
   │  ├── theme.xml
   │  └── theme1.xml
   ├── workbook.xml
   └── worksheets/
      └── sheet1.xml

Generally when you get a file handle/descriptor you have to deal with the OS changing its internal index of where its at when reading, which is why rust needs a &mut as needing an exclusive reference is generally the correct way to represent this kind of reading operation, but in doing so may provide some roughness when trying to work with threads.

The biggest gains would be from reading sharedStrings.xml along with each worksheet in parallel. This would really help in a worst case scenario where there are many sheets and each one only has strings and each string is unique. But when a sheet only has numbers, and its just one sheet, it wouldn't have any performance gains, and instead you have to deal with overhead of threads being spawned for no gain, though small. And really this isn't true multithreadedness as parsing an xml file. Its just multiple files being read at once.

Perhaps it might be possible to instead store the &Path in the workbook and then get separate handles to the file in each thread so they don't need to be passed around, needing to worry about the &mut reference.

Having a separate streams will also help with this issue:

When the BufReader is dropped, the contents of its buffer will be discarded. Creating multiple instances of a BufReader on the same stream can cause data loss. Reading from the underlying reader after unwrapping the BufReader with BufReader::into_inner can also cause data loss.

RoloEdits commented 8 months ago

https://github.com/tafia/calamine/blob/09c25bab35e2f958431b861f1e3581bb9ff74a95/src/xlsx.rs#L778-L798

The reading functions can probably be changed to return the data instead, and then populate the struct with that data, rather than relying on mutable changes to update it. Inside those functions they can spawn a thread and then return the join handle and block the main thread until the last one is finished.

RoloEdits commented 8 months ago

I ran some benchmarks on different rust xml parsers to see if there is a faster implementation and the two that seem to be the likely candidates were quick-xml and xmlparser. Currently calamine uses quick-xml. In the docs for roxmltree is has a benchmark for parsers:

test huge_xmlparser      ... bench:   1,672,879 ns/iter (+/- 20,140)
test huge_quick_xml      ... bench:   2,396,037 ns/iter (+/- 39,752)
test huge_xmlrs          ... bench:  36,258,312 ns/iter (+/- 180,438)

test large_xmlparser     ... bench:     730,787 ns/iter (+/- 22,924)
test large_quick_xml     ... bench:   1,250,053 ns/iter (+/- 21,943)
test large_xmlrs         ... bench:  11,239,516 ns/iter (+/- 76,937)

test medium_quick_xml    ... bench:     206,232 ns/iter (+/- 2,157)
test medium_xmlparser    ... bench:     240,737 ns/iter (+/- 4,531)
test medium_xmlrs        ... bench:   3,975,916 ns/iter (+/- 44,967)

test tiny_xmlparser      ... bench:       1,078 ns/iter (+/- 17)
test tiny_quick_xml      ... bench:       2,233 ns/iter (+/- 70)
test tiny_xmlrs          ... bench:      17,155 ns/iter (+/- 429)

Looking at this I decided to try out my own benchmark using the data from the qsv benchmark. I saved it as an xslx file and unzipped it to get the worksheet to use as the bench case, which turned out to be 1.3GB raw. I took the examples of both libraries and slightly modified them so as to take out the io overhead when printing to stdout.

use quick_xml::events::Event;
use quick_xml::reader::Reader;

// include file in binary to prevent IO mucking with bench
static XML: &str = include_str!(r"..\test\sheet1.xml");

fn main() {
    let which = std::env::args().nth(1).expect("must provide parser to run");

    match which.as_str() {
        "xmlparser" => xmlparser(),
        "quickxml" => quick_xml(),
        _ => {
            panic!("didnt provide either `xmlparser` or `quickxml` but instead provided `{which}`")
        }
    }
}

fn xmlparser() {
    let mut count = 0;
    for _token in xmlparser::Tokenizer::from(XML) {
        count += 1;
    }
    println!("tokens = {count}");
}

fn quick_xml() {
    let mut reader = Reader::from_str(XML);
    reader.trim_text(true);

    let mut texts = 0;
    let mut starts = 0;
    let mut buf = Vec::with_capacity(1024 );

    // The `Reader` does not implement `Iterator` because it outputs borrowed data (`Cow`s)
    loop {
        // NOTE: this is the generic case when we don't know about the input BufRead.
        // when the input is a &str or a &[u8], we don't actually need to use another
        // buffer, we could directly call `reader.read_event()`
        match reader.read_event_into(&mut buf) {
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            // exits the loop when reaching end of file
            Ok(Event::Eof) => break,
            Ok(Event::Start(_e)) => starts += 1,
            Ok(Event::Text(_e)) => texts += 1,
            // There are several other `Event`s we do not consider here
            _ => (),
        }
        // if we don't keep a borrow elsewhere, we can clear the buffer to keep memory usage low
        buf.clear();
    }
    println!("Starts: {starts}\nTexts:{texts}")
}

This is on a AMD RYZEN 9 5900X@4.0 GHz running Windows 11. These were the results I got:

hyperfine --warmup 3 'target\release\xml_parser_compare.exe xmlparser' 'target\release\xml_parser_compare.exe quickxml'
Benchmark 1: target\release\xml_parser_compare.exe xmlparser
  Time (mean ± σ):      6.407 s ±  0.022 s    [User: 6.033 s, System: 0.372 s]
  Range (min … max):    6.376 s …  6.439 s    10 runs

Benchmark 2: target\release\xml_parser_compare.exe quickxml
  Time (mean ± σ):      5.854 s ±  0.017 s    [User: 5.525 s, System: 0.339 s]
  Range (min … max):    5.817 s …  5.871 s    10 runs

Summary
  target\release\xml_parser_compare.exe quickxml ran
    1.09 ± 0.00 times faster than target\release\xml_parser_compare.exe xmlparser

So contrary to roxmltree benchmarks, quick-xml was actually 9% faster.

Memory usage for both was very low so I don't think that is a consideration to benchmark.

Considering that jqnatividad's test took 10 seconds to read, there is about 4 seconds of overhead that is being done outside of parsing of the worksheet file, which is really the only area that would be actionable here. Some time is going into the sharedStrings file surely, and that is another area that can be a time sink for parsing, so could even be the bulk of the remaining 4 seconds.

In the most general of use cases, the sheet is lazily read, as in it doesn't read the other worksheets, only the one specified. This is forcing a linear read of the sharedStrings file and then the sheet. It gets worse if you were to try to loop over the sheets, as it could be another 6 second read per iteration.

To make the reads, and therefore the parsing, concurrent, it might be best to read all files at the start so that the time is only of the time it takes to read the longest file. This comes at the cost of memory, however, as all the data would be in memory. This could prove an issue for very large datasets and causing OOM errors. The in-memory representation of the parsed data wont have the same overhead as the xml file itself, but if someone has 50 sheets with a million rows and any decent number of columns, then it could become a very real concern.

As far as parallel parsing of a single xml file, I found this excellent dissertation on a technique to do such, mainly by breaking the file into chunks and then parsing those on other threads. The lib is written in C++, PXML, but I can't find any source code for it. And implementing this is rust would be outside the scope of this library. But perhaps there could be a way to hack together something with xmlparser, as that works off an iterator, and something like rayon. But this has yet to be explored by me. Could also try, for both libraries, to spawn a bunch of tasks in the loops and then wait on them to be completed and save some metadata to make sure cell data is kept in order.

dimastbk commented 8 months ago

This comes at the cost of memory, however, as all the data would be in memory. This could prove an issue for very large datasets and causing OOM errors.

Yes, and it's problem now - https://github.com/tafia/calamine/issues/27.

I think, lazy loading is also important to reduce memory and increase performance in some cases (reading a few rows from large file, because now in this case calamine reads slower pure python library - openpyxl).

[75.00%] ··· io.excel.ReadExcel.time_read_excel                         1/10 failed
[75.00%] ··· ========== ============ ============ ============ ========== =========
             --                                     ext                            
             ---------- -----------------------------------------------------------
               engine       ods          xlsx         xlsm        xlsb       xls   
             ========== ============ ============ ============ ========== =========
              calamine   1.60±0.01s   1.51±0.01s   1.50±0.01s   954±20ms   930±4ms 
              default      failed     5.31±0.01s   5.35±0.01s   3.63±0s    1.32±0s 
             ========== ============ ============ ============ ========== =========
             For parameters: 'default', 'ods'

             asv: benchmark timed out (timeout 60.0s)

[100.00%] ··· io.excel.ReadExcelNRows.time_read_excel                   1/10 failed
[100.00%] ··· ========== ========= ============ ============ ============ =========
              --                                    ext                            
              ---------- ----------------------------------------------------------
                engine      ods        xlsx         xlsm         xlsb        xls   
              ========== ========= ============ ============ ============ =========
               calamine   795±2ms    638±1ms      639±2ms     98.4±0.3ms   109±1ms 
               default     failed   14.4±0.1ms   14.3±0.4ms   62.8±0.4ms   810±4ms 
              ========== ========= ============ ============ ============ =========
              For parameters: 'default', 'ods'

              asv: benchmark timed out (timeout 60.0s)

Reading using pandas 2.2 from https://github.com/MarkPflug/Benchmarks/tree/main/source/Benchmarks/Data.

RoloEdits commented 8 months ago

Using that data's sheet1, a 34.4MB large file, on my machine:

hyperfine --warmup 3 'target\release\xml_parser_compare.exe xmlparser' 'target\release\xml_parser_compare.exe quickxml'
Benchmark 1: target\release\xml_parser_compare.exe xmlparser
  Time (mean ± σ):     172.0 ms ±   0.8 ms    [User: 155.9 ms, System: 18.3 ms]
  Range (min … max):   171.2 ms … 174.0 ms    17 runs

Benchmark 2: target\release\xml_parser_compare.exe quickxml
  Time (mean ± σ):     178.6 ms ±   1.2 ms    [User: 167.7 ms, System: 15.7 ms]
  Range (min … max):   177.1 ms … 181.0 ms    16 runs

Summary
  target\release\xml_parser_compare.exe xmlparser ran
    1.04 ± 0.01 times faster than target\release\xml_parser_compare.exe quickxml

The other files are only a few KB each:

Those shouldn't take that long to parse, at least compared to the 34.4MB file. So for it to take 1.5 seconds in the python bench is certainly interesting. I will try to go through some other libraries implementation for the lazy reading, and generally see if we cant try to put together something to increase performance.

From the other benchmark I did, it was doing ~220MBs of throughput. From the vts-xml project I found this:

The world's fastest XML parser: On a Core2 2.5Ghz Desktop, VTD-XML outperforms DOM parsers by 5x~12x, delivering 150~250 MB/sec per core sustained throughput.

From the pugixml webpage on benchmarks, I found this: image

From the parse time chart, you can see that 250MB/s isn't really that fast. And my benchmark was done on much newer hardware. And the throughput I was getting was only ~220MB/s. This is all very rough of course, but not painting a great picture.

pugixml and RapidXml perform much faster. The downside is that they are DOM parsers, not SAX, which is not ideal for our use case.