scraperwiki / spreadsheet-download-tool

A ScraperWiki plugin for downloading data from a box as a CSV or Excel spreadsheet
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Pyexcelerate #55

Closed morty closed 10 years ago

morty commented 10 years ago

This uses PyExcelerate to generate an XLSX file. I have checked in a modified copy of the library as a temp measure.

Still need to do many of the things in the other pull request as the code still doesn't handle errors during the creation of the output files correctly.

Don't think we should deploy until Monday. If people want to test it out it has been deployed on in the "Pyexcelerate Test" tool in the ScraperWiki Test datahub @pwaller @paulfurley @drj11.

morty commented 10 years ago

Memory usage might be a problem. I've just tested with a million row spreadsheet (with small rows) and it got up to 9.1% of 8Gb on Premium. As @paulfurley points out this could be a problem at midnight when lots of datasets update, hit the status endpoint and cause the spreadsheets to be regenerated.

pwaller commented 10 years ago

This is no worse than it currently is (65krow -> 265.03mb for xlwt, 264.0mb for PyExcelerate), but I believe the road is paved for making the output streaming using pyexcelerate.

My approach would be to add a new stream_save() which accepts a function which returns a generator of sheets which in turn returns a generator of rows. It would have to write anything which needs to know how many sheets there are last. Merged cells are slightly annoying since they are represented separately. Under the assumption that merged cells are relatively rare, it's probably reasonable to just stuff them into memory and write them out when you're done.