Very slow (and increasingly slow) to read cells from a read-only worksheet

Phrogz commented 3 days ago

Looping through 1000 rows and finding the first cell in each row in a read-only workbook takes 13s and calls openpyxl.worksheet._reader.WorkSheetParser.parse_cell() over 7 million times.

I have a ~large workbook with some ~large sheets. The .xlsx is 583kB, there are 9 worksheets:

sheet='alpha'   1015x18 = 18270 cells
sheet='beta'    786x18  = 14148 cells
sheet='gamma'   778x18  = 14004 cells
sheet='delta'   778x18  = 14004 cells
sheet='epsilon' 90x78   = 7020 cells
sheet='zeta'    104x112 = 11648 cells
sheet='eta'     198x18  = 3564 cells
sheet='theta'   55x18   = 990 cells
sheet='iota'    35x18   = 630 cells

import openpyxl
import time
from openpyxl.worksheet._reader import WorkSheetParser

book = openpyxl.load_workbook("test.xlsx", read_only=True, data_only=True)
sheet = book['alpha']
maxr = sheet.max_row
rows = range(1, maxr + 1)

print(maxr)                        #=> 1015
print(WorkSheetParser.parse_calls) #=> 0
t0 = time.perf_counter()
[sheet.cell(r, 1) for r in rows]   # I'm not even asking for the value!
print(time.perf_counter() - t0)    #=> 13.412
print(WorkSheetParser.parse_calls) #=> 7291413

The only change I made to the library was this change in _reader.py:

    parse_calls = 0
    def parse_cell(self, element):
        WorkSheetParser.parse_calls += 1

Phrogz commented 3 days ago

The first row takes ~0.4ms. The last row takes ~31ms. The slowdown is linear across the number of rows.

Phrogz commented 3 days ago

Using iter_rows() is significantly faster:

print(sheet.max_row)               #=> 1015
print(WorkSheetParser.parse_calls) #=> 0
t0 = time.perf_counter()
[row[0] for row in sheet.iter_rows()]
print(time.perf_counter() - t0)    #=> 0.0368
print(WorkSheetParser.parse_calls) #=> 14535

Interestingly:

If the workbook is loaded with read_only=False this test gets 6x faster.
If the workbook is loaded with read_only=True, Worksheet.iter_cols() does not exist.

theorchard / openpyxl

Very slow (and increasingly slow) to read cells from a read-only worksheet #66