scraperwiki / spreadsheet-download-tool

A ScraperWiki plugin for downloading data from a box as a CSV or Excel spreadsheet
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

1 DAY: Stress test (and fix) new Spreadsheet Download tool #41

Closed zarino closed 10 years ago

zarino commented 10 years ago

Users are reporting erratic behaviour when downloading large Twitter datasets.

One user:

When i try to download as a file a bunch of errors flash for a second (can’t take a screenshot because it goes away too fast) and the process is stuck

Another:

Eventually it gives a screen full of errors but clears fast and repeats. […] Downloads are working on small sets of tweets. I’m generating a new set of followers for the same person now and can test in a few hours. […] New set on > 61K rows fails as does the old but 300 rows works.

drj11 commented 10 years ago

My #1 thing to investigate would be whether they are running out of the 500MB RAM limit on free. The symptoms of this are that a process will grow in memory usage until it hits the limit and then apparently be killed (by the kernel) with the equivalent of kill -9. Its exit status will be 137.

pwaller commented 10 years ago

(was a mail I sent to the developers@ mailing list)

Did these traverse anyone's radar already?

http://openpyxl.readthedocs.org/en/latest/ http://xlsxwriter.readthedocs.org/en/latest/ http://xlsxwriter.readthedocs.org/en/latest/working_with_memory.html

Maybe we can solve the memory issues of the Excel sheet generation, if they aren't already solved.

And another one:

https://github.com/kz26/PyExcelerate

zarino commented 10 years ago

@drj11: the 500MB RAM limit might be a problem, but at least one of the users in the original post was on a premium server, so it must be something else.

I've noticed a shed load of exec requests get sent out when the tool runs from a clean slate. So many, in fact, we start getting 429 Too Many Requests responses from the exec endpoint. Looks like a bug somewhere in my logic (or maybe a javascript setInterval() that's being set more than once).

zarino commented 10 years ago

create_downloads.py is sporadically encountering a SQLite error:

Traceback (most recent call last):
  File "tool/create_downloads.py", line 277, in <module>
    main()
  File "tool/create_downloads.py", line 123, in main
    save_state("{}.csv".format(make_filename(table['name'])), 'table', table['name'], 'generated')
  File "tool/create_downloads.py", line 243, in save_state
    }, '_state')
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sqlite.py", line 31, in save
    dt.create_table(data, table_name = table_name, error_if_exists = False)
  File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 232, in create_table
    self.__check_and_add_columns(table_name, row)
  File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 182, in __check_and_add_columns
    self.execute(sql, commit = True)
  File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 136, in execute
    self.cursor.execute(sql, *args)
sqlite3.OperationalError: disk I/O error

I think reset_everything.sh is being executed too often, causing the scraperwiki.sqlite file to disappear from under dumptruck's feet, hence the I/O error.

@frabcus and I are looking into why reset_everything.sh is getting invoked, and whether we can replace it with something less drastic.

zarino commented 10 years ago

Related commits:

https://github.com/scraperwiki/spreadsheet-download-tool/commit/aac4a94ddde70627c1f471225829956bb864caa3 https://github.com/scraperwiki/spreadsheet-download-tool/commit/d743795731431567bda26642e1dd495f8941e856

zarino commented 10 years ago

There have been a bunch of fixes to the UI, which solves the users' immediate problems.

The efficiency of xlwt (the library we use to create excel files) remains an issue. In my own highly unscientific testing, create_downloads.py took roughly 500MB RAM to turn a single 60,000 row table into a CSV (~18MB disk size) and XLS file (~20MB disk size). When there were two identical tables in the database, each of 60,000 rows, it took about 650MB RAM to make two CSVs (~18MB each) and an XLS (~40MB).

pwaller commented 10 years ago

Btw, if you want to see the high water mark:

$ /usr/bin/time -v echo |& grep Maximum
    Maximum resident set size (kbytes): 2432
pwaller commented 10 years ago

Some significant improvements in #45 on streaming the input. Fixing the excel output still remains though.

zarino commented 10 years ago

See also https://github.com/scraperwiki/spreadsheet-download-tool/issues/34

frabcus commented 10 years ago

Not sure this broad issue is useful.