tafia / calamine

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
MIT License
1.6k stars 155 forks source link

feat(docs): add benchmarks and plots in readme #367

Closed RoloEdits closed 8 months ago

RoloEdits commented 8 months ago

Went through and benchmarked some other libraries to see where calamine stood compared to other ecosystems. Decided to add it to the docs. As well as, after seeing the results, file an issue for excelize.

I wanted to add umya-spreadsheet, but it didn't seem to have any way to directly iterate over the rows? At least I couldn't tell from the wording in the docs nor the function signitures. If you manage to figure out a way to do that, and want another rust comparison, I don't mind adding it.

Git history is a bit messy with fixes, squashing might be best.

RoloEdits commented 8 months ago

Got some pointers from a maintainer of excelize, I need to update the data. I'll try to get to it soon as I can.

dimastbk commented 8 months ago

calamine vs openpyxl (read_only mode), python3.11 on my PC:

Benchmark 1: calamine
  Time (mean ± σ):     21.299 s ±  0.093 s    [User: 20.361 s, System: 0.931 s]
  Range (min … max):   21.193 s … 21.512 s    10 runs

Benchmark 1: openpyxl
  Time (mean ± σ):     134.424 s ±  0.582 s    [User: 133.749 s, System: 0.654 s]
  Range (min … max):   133.057 s … 135.192 s    10 runs

Code:

```python3 from openpyxl import load_workbook wb = load_workbook(filename='NYC_311_SR_2010-2020-sample-1M.xlsx', read_only=True) ws = wb['NYC_311_SR_2010-2020-sample-1M'] for row in ws.rows: _ = row # Close the workbook after reading wb.close() ```
dimastbk commented 8 months ago

I wanted to add umya-spreadsheet, but it didn't seem to have any way to directly iterate over the rows?

I didn't find this too. With this code, application allocate over 10 GB memory and I killed it.

    let path = std::path::Path::new("NYC_311_SR_2010-2020-sample-1M.xlsx");
    let book = umya_spreadsheet::reader::xlsx::read(path).unwrap();
    let sheet = book.get_sheet_by_name("NYC_311_SR_2010-2020-sample-1M").unwrap();
    let _ = sheet.get_collection_to_hashmap();

    // OR
    let path = std::path::Path::new("NYC_311_SR_2010-2020-sample-1M.xlsx");
    let book = umya_spreadsheet::reader::xlsx::lazy_read(path).unwrap();
    let _ = book.get_lazy_read_sheet_cells(&0).unwrap();
dimastbk commented 8 months ago

What version of python did you use?

python3.11 138.470 s
python3.10 158.893 s
RoloEdits commented 8 months ago

@dimastbk Python 3.11.5. What kind of hardware are you using?

dimastbk commented 8 months ago

Thanks. I just surprised so big different between python3.10 and 3.11. Intel® Core™ i7-9700, KDE Neon 5.27

RoloEdits commented 8 months ago

I'm also interested in how much slower mine is compared to yours. 100 seconds. I'm not even sure what could account for that much difference.

tafia commented 8 months ago

Thanks! Very informative