jazzband / tablib

Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.
https://tablib.readthedocs.io/
MIT License
4.61k stars 593 forks source link

Loading xlsx in read_only mode yields giant sheet #482

Closed dmosberger closed 3 years ago

dmosberger commented 3 years ago

Loading spreadsheet tst.xlsx with read_only=True results in tablib trying to construct a sheet with 1048576 rows, which can take a long time and/or exhaust available memory:

from openpyxl import load_workbook
wb=load_workbook(filename='tst.xlsx', read_only=True)
wb.active.max_row
1048576

with read_only=False, openpyxl returns the expected number of rows:

from openpyxl import load_workbook
wb=load_workbook(filename='tst.xlsx', read_only=False)
wb.active.max_row
4

Openpyxl has this comment about read-only mode:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly.

so it seems to me it'd be safer if tablib were to use read_only=False or are there other issues with doing that? If the latter is the case, would it be possible to provide a means to force a spreadsheet to be loaded with read-only turned off? Unfortunately, we don't have any control over the tool that's generating the "incorrect" files as they come from a third-party web site.

fix.diff.txt

claudep commented 3 years ago

I'd be reluctant to renounce on the read_only optimization due to bad third-party behavior. However, I think we could expose a read_only parameter for import_set and import_book methods. Feel free to prepare a patch for this.

dmosberger commented 3 years ago

Thanks for getting this resolved so quickly. Much appreciated!