jmcnamara / XlsxWriter

A Python module for creating Excel XLSX files.
https://xlsxwriter.readthedocs.io
BSD 2-Clause "Simplified" License
3.66k stars 632 forks source link

question: `constant_memory `actually much faster for very large data #1091

Closed ghylander closed 2 months ago

ghylander commented 2 months ago

Question

So not exactly a question, more like an statement.

In the docs it says that the performance should be approximately the same as normal mode., but in all of my tests, using constant_memory actually results in much faster data writing and file closing than not using it (half the time for my use case).

My use case is writing a workbook of 20 sheets, each sheet 200k rows x 32 columns in size, all numbers. As additional context, I have a list of 20 lists. Each of the 20 lists is a list of 31 lists. I write each of the 20 list in a different worksheet, and I write them in pairs (so 20 sheets written in 10 iterations). This is why you'll see Time to generate pair of worksheets below.

All I/O operations take place on a PCIe WD Black SN770, for reference.

I'm using time.perf_counter() to measure execution times.

The speed difference comes from: 1) Overall faster writing speed. Using constant_memory, each pair of sheets takes about 45 seconds, totalling 450 seconds. When not using constant_memory, it takes around 20 seconds per pair of sheets until memory runs out, then it writes all data, which takes anywhere from 150 to 300 seconds, totalling from 700 to 800 seconds.

2) Closing the file is significantly faster too. Using constant_memory, it takes about 120 seconds. When not using constant_memory, closing the workbook takes much longer, around 400 seconds.

Here are the outputs of my latest comparison, but the results have always been consistent:

Using constant_memory:

Time to generate pair of worksheets using xlsxwriter with constant_memory: 46.19610169995576
Total time elapsed so far using xlsxwriter with constant_memory: 54.478417800040916
Time to generate pair of worksheets using xlsxwriter with constant_memory: 46.302024799981155
Total time elapsed so far using xlsxwriter with constant_memory: 100.78118239995092
Time to generate pair of worksheets using xlsxwriter with constant_memory: 47.73539729998447
Total time elapsed so far using xlsxwriter with constant_memory: 148.53308600001037
Time to generate pair of worksheets using xlsxwriter with constant_memory: 46.95760850002989
Total time elapsed so far using xlsxwriter with constant_memory: 195.4907984000165
Time to generate pair of worksheets using xlsxwriter with constant_memory: 46.81849370000418
Total time elapsed so far using xlsxwriter with constant_memory: 242.30938790005166
Time to generate pair of worksheets using xlsxwriter with constant_memory: 46.63647949998267
Total time elapsed so far using xlsxwriter with constant_memory: 288.9459394000005
Time to generate pair of worksheets using xlsxwriter with constant_memory: 45.86998269998003
Total time elapsed so far using xlsxwriter with constant_memory: 334.816036800039
Time to generate pair of worksheets using xlsxwriter with constant_memory: 45.63154129998293
Total time elapsed so far using xlsxwriter with constant_memory: 380.44767759996466
Time to generate pair of worksheets using xlsxwriter with constant_memory: 45.9676662000129
Total time elapsed so far using xlsxwriter with constant_memory: 426.41541400004644
Time to generate pair of worksheets using xlsxwriter with constant_memory: 45.946267000050284
Total time elapsed so far using xlsxwriter with constant_memory: 472.36175100004766
Time to generate workbook using xlsxwriter with constant_memory: 464.0805752999149

Time to save workbook using xlsxwriter with constant_memory: 152.00226300000213
Total time to generate xlsx file with xlsxwriter with constant_memory: 624.3990084000397

Not using constant_memory:

Time to generate pair of worksheets using xlsxwriter: 19.09593490010593
Total time elapsed so far using xlsxwriter: 19.096536799916066
Time to generate pair of worksheets using xlsxwriter: 19.046457499964163
Total time elapsed so far using xlsxwriter: 38.143069399986416
Time to generate pair of worksheets using xlsxwriter: 161.56772080005612
Total time elapsed so far using xlsxwriter: 199.7157771999482
Time to generate pair of worksheets using xlsxwriter: 23.56007010000758
Total time elapsed so far using xlsxwriter: 223.28116349992342
Time to generate pair of worksheets using xlsxwriter: 20.014438300044276
Total time elapsed so far using xlsxwriter: 243.2957026999211
Time to generate pair of worksheets using xlsxwriter: 190.62783269991633
Total time elapsed so far using xlsxwriter: 433.931525999913
Time to generate pair of worksheets using xlsxwriter: 19.953834800049663
Total time elapsed so far using xlsxwriter: 453.89062199997716
Time to generate pair of worksheets using xlsxwriter: 20.207211900036782
Total time elapsed so far using xlsxwriter: 474.1042058999883
Time to generate pair of worksheets using xlsxwriter: 20.525868900003843
Total time elapsed so far using xlsxwriter: 494.63034699996933
Time to generate pair of worksheets using xlsxwriter: 282.40980679995846
Total time elapsed so far using xlsxwriter: 777.0470418999903
Time to generate workbook using xlsxwriter: 777.0544768999098

Time to save workbook using xlsxwriter: 396.0096784000052
Total time to generate xlsx file with xlsxwriter: 1173.404752500006
jmcnamara commented 2 months ago

Thanks for the information. A couple of observations:

  1. The close() time, or better still a total xlsxwriter runtime is probably a better metric, since creating worksheets faster may not help if the overall xlsx/workbook creation is the real bottleneck.
  2. Performance testing in general is hard since there are a lot of variables that can have an effect. From memory the performance was I/O bound so a fast disk/storage will improve the results. Also, there are memory limits that will eventually be hit (like your case?) where the generally linear performance profile of xlsxwriter will plateau.
  3. The statement "the performance should be approximately the same as normal mode" was mainly true at the time of writing, on the machine I was writing it on and with this test case: https://github.com/jmcnamara/XlsxWriter/blob/main/dev/performance/perf_pyx.py

Rerunning it now for 100,000 rows by 50 columns gives these results:

Which is approximately a 10% difference. Your 400s vs 120s difference is much more impressive (if it is measuring all the time taken by XlsxWriter).

Either way thanks for the feedback. It is always good to get some real world metrics.

ghylander commented 2 months ago

I was testing multiple libraries and approaches at creating the xlsx files, that's why I have so many timers.

This is how I timed these processes:

workbookXLSX = xlsxwriter.Workbook('test_xlsxwriter_constant_memory.xlsx', {'constant_memory': True})

xlsxwriterTimeStartWorkbook = time.perf_counter()
for pair_of_lists in superlist:
    xlsxwriterTimeStartWorksheetPair = time.perf_counter()

    worksheet = workbookXLSX.add_worksheet(list1_ttitle)
    for rowid, row in enumerate(list1):
        worksheet.write_row(rowId, 0, row)

    worksheet = workbookXLSX.add_worksheet(list2_ttitle)
    for rowid, row in enumerate(list2):
        worksheet.write_row(rowId, 0, row)

    print(f'Time to generate pair of worksheets using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartWorksheetPair}')
    print(f'Total time elapsed so far using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartTotal}')

print(f'Time to generate workbook using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartWorkbook}')

xlsxwriterTimeStartWorkbookSave = time.perf_counter()
workbookXLSX.close()
print(f'Time to save workbook using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartWorkbookSave}')
print(f'Total time to generate xlsx file with xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartTotal}')

The whole process, from before the first worksheet is written until after the file has been closed, takes about half the time

Of course, ideally one would be able to fit all data in memory.

jmcnamara commented 2 months ago

Thanks once more. That is good to know.

jmcnamara commented 2 months ago

If you can use Rust I'd be interested to know how your workload compares on rust_xlsxwriter: https://github.com/jmcnamara/rust_xlsxwriter

ghylander commented 2 months ago

If you can use Rust I'd be interested to know how your workload compares on rust_xlsxwriter: https://github.com/jmcnamara/rust_xlsxwriter

I'll try to give it a shot, I haven't used rust in quite long