CSV and XLS Extraction fails on tables with more than 65536 rows

zarino commented 10 years ago

Extracting a dataset with more than 65536 rows results in this exception:

Traceback (most recent call last):
  File "tool/create_downloads.py", line 565, in <module>
    main()
  File "tool/create_downloads.py", line 279, in main
    generate_for_box(box_url)
  File "tool/create_downloads.py", line 308, in generate_for_box
    dump_tables(excel_output, tables, paged_rows)
  File "tool/create_downloads.py", line 348, in dump_tables
    write_excel_csv(excel_output, table['name'], filename, rows)
  File "tool/create_downloads.py", line 336, in write_excel_csv
    write_excel_row(row)
  File "tool/create_downloads.py", line 263, in write_row
    sheet.write(j, i, content)
  File "/usr/local/lib/python2.7/dist-packages/xlwt/Worksheet.py", line 1030, in write
    self.row(r).write(c, label, style)
  File "/usr/local/lib/python2.7/dist-packages/xlwt/Worksheet.py", line 1078, in row
    self.__rows[indx] = self.Row(indx, self)
  File "/usr/local/lib/python2.7/dist-packages/xlwt/Row.py", line 42, in __init__
    raise ValueError("row index (%r) not an int in range(65536)" % rowx)
ValueError: row index (65536) not an int in range(65536)

Not only is the current Excel worksheet discarded, so is the entire Excel file, and all CSV files too.

There are a few options here:

Check the length of each table before attempting to write it, and if it exceeds 65536 rows, don't attempt to create an Excel worksheet.
Just catch the exception (more "ask forgiveness"-y) and carry on.
Create XLSX files instead of XLS files, to avoid the 65536 row limitation (this has the added advantage of replacing the inefficient xlwt library)

Incidentally, the exception is not communicated to the user, who just sees an infinite spinner. That's covered by this issue: https://github.com/scraperwiki/spreadsheet-download-tool/issues/48

zarino commented 10 years ago

An error is now presented when the limit is hit, but this is a long way off actually fixing the problem:

screen shot 2013-11-07 at 09 12 52

zarino commented 10 years ago

The more foresightful solution would be to replace xlwt with something like PyExcelerate or OpenPyXL.

See https://github.com/scraperwiki/spreadsheet-download-tool/issues/41 for more discussion of alternative libraries.

pwaller commented 10 years ago

IIRC OpenPyXL was used before and moved away from due to poor performance. xlsxwriter looks like it might be plausible.

fawkesley commented 10 years ago

I've had a support email about this. It's a rather serious regression for our paying customers.

It was a deliberate feature that the CSVs would always be generated regardless of whether the data was too big for the XLS file format.

pwaller commented 10 years ago

@paulfurley apologies, I did quite a bit of refactoring. I thought I had preserved the old behaviour, but evidently not. Is this a regression? If so, it wasn't clear that this was a requirement.

[ ] Introduce a test which exercises both CSV and Excel writing but mocks Excel so that it throws an exception -- and then ensures that the CSV was still run through correctly.

scraperwiki / spreadsheet-download-tool

CSV and XLS Extraction fails on tables with more than 65536 rows #49