Closed ghylander closed 2 months ago
Thanks for the information. A couple of observations:
close()
time, or better still a total xlsxwriter
runtime is probably a better metric, since creating worksheets faster may not help if the overall xlsx/workbook creation is the real bottleneck.xlsxwriter
will plateau.Rerunning it now for 100,000 rows by 50 columns gives these results:
Which is approximately a 10% difference. Your 400s vs 120s difference is much more impressive (if it is measuring all the time taken by XlsxWriter).
Either way thanks for the feedback. It is always good to get some real world metrics.
I was testing multiple libraries and approaches at creating the xlsx files, that's why I have so many timers.
This is how I timed these processes:
workbookXLSX = xlsxwriter.Workbook('test_xlsxwriter_constant_memory.xlsx', {'constant_memory': True})
xlsxwriterTimeStartWorkbook = time.perf_counter()
for pair_of_lists in superlist:
xlsxwriterTimeStartWorksheetPair = time.perf_counter()
worksheet = workbookXLSX.add_worksheet(list1_ttitle)
for rowid, row in enumerate(list1):
worksheet.write_row(rowId, 0, row)
worksheet = workbookXLSX.add_worksheet(list2_ttitle)
for rowid, row in enumerate(list2):
worksheet.write_row(rowId, 0, row)
print(f'Time to generate pair of worksheets using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartWorksheetPair}')
print(f'Total time elapsed so far using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartTotal}')
print(f'Time to generate workbook using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartWorkbook}')
xlsxwriterTimeStartWorkbookSave = time.perf_counter()
workbookXLSX.close()
print(f'Time to save workbook using xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartWorkbookSave}')
print(f'Total time to generate xlsx file with xlsxwriter with constant_memory: {time.perf_counter() - xlsxwriterTimeStartTotal}')
The whole process, from before the first worksheet is written until after the file has been closed, takes about half the time
Of course, ideally one would be able to fit all data in memory.
Thanks once more. That is good to know.
If you can use Rust I'd be interested to know how your workload compares on rust_xlsxwriter
: https://github.com/jmcnamara/rust_xlsxwriter
If you can use Rust I'd be interested to know how your workload compares on
rust_xlsxwriter
: https://github.com/jmcnamara/rust_xlsxwriter
I'll try to give it a shot, I haven't used rust in quite long
Question
So not exactly a question, more like an statement.
In the docs it says that
the performance should be approximately the same as normal mode.
, but in all of my tests, usingconstant_memory
actually results in much faster data writing and file closing than not using it (half the time for my use case).My use case is writing a workbook of 20 sheets, each sheet 200k rows x 32 columns in size, all numbers. As additional context, I have a list of 20 lists. Each of the 20 lists is a list of 31 lists. I write each of the 20 list in a different worksheet, and I write them in pairs (so 20 sheets written in 10 iterations). This is why you'll see
Time to generate pair of worksheets
below.All I/O operations take place on a PCIe WD Black SN770, for reference.
I'm using
time.perf_counter()
to measure execution times.The speed difference comes from: 1) Overall faster writing speed. Using
constant_memory
, each pair of sheets takes about 45 seconds, totalling 450 seconds. When not usingconstant_memory
, it takes around 20 seconds per pair of sheets until memory runs out, then it writes all data, which takes anywhere from 150 to 300 seconds, totalling from 700 to 800 seconds.2) Closing the file is significantly faster too. Using
constant_memory
, it takes about 120 seconds. When not usingconstant_memory
, closing the workbook takes much longer, around 400 seconds.Here are the outputs of my latest comparison, but the results have always been consistent:
Using
constant_memory
:Not using
constant_memory
: