Closed zarino closed 10 years ago
My #1 thing to investigate would be whether they are running out of the 500MB RAM limit on free. The symptoms of this are that a process will grow in memory usage until it hits the limit and then apparently be killed (by the kernel) with the equivalent of kill -9
. Its exit status will be 137.
(was a mail I sent to the developers@ mailing list)
Did these traverse anyone's radar already?
http://openpyxl.readthedocs.org/en/latest/ http://xlsxwriter.readthedocs.org/en/latest/ http://xlsxwriter.readthedocs.org/en/latest/working_with_memory.html
Maybe we can solve the memory issues of the Excel sheet generation, if they aren't already solved.
And another one:
@drj11: the 500MB RAM limit might be a problem, but at least one of the users in the original post was on a premium server, so it must be something else.
I've noticed a shed load of exec requests get sent out when the tool runs from a clean slate. So many, in fact, we start getting 429 Too Many Requests responses from the exec endpoint. Looks like a bug somewhere in my logic (or maybe a javascript setInterval()
that's being set more than once).
create_downloads.py
is sporadically encountering a SQLite error:
Traceback (most recent call last):
File "tool/create_downloads.py", line 277, in <module>
main()
File "tool/create_downloads.py", line 123, in main
save_state("{}.csv".format(make_filename(table['name'])), 'table', table['name'], 'generated')
File "tool/create_downloads.py", line 243, in save_state
}, '_state')
File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sqlite.py", line 31, in save
dt.create_table(data, table_name = table_name, error_if_exists = False)
File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 232, in create_table
self.__check_and_add_columns(table_name, row)
File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 182, in __check_and_add_columns
self.execute(sql, commit = True)
File "/usr/local/lib/python2.7/dist-packages/dumptruck/dumptruck.py", line 136, in execute
self.cursor.execute(sql, *args)
sqlite3.OperationalError: disk I/O error
I think reset_everything.sh
is being executed too often, causing the scraperwiki.sqlite
file to disappear from under dumptruck's feet, hence the I/O error
.
@frabcus and I are looking into why reset_everything.sh
is getting invoked, and whether we can replace it with something less drastic.
There have been a bunch of fixes to the UI, which solves the users' immediate problems.
The efficiency of xlwt
(the library we use to create excel files) remains an issue. In my own highly unscientific testing, create_downloads.py
took roughly 500MB RAM to turn a single 60,000 row table into a CSV (~18MB disk size) and XLS file (~20MB disk size). When there were two identical tables in the database, each of 60,000 rows, it took about 650MB RAM to make two CSVs (~18MB each) and an XLS (~40MB).
Btw, if you want to see the high water mark:
$ /usr/bin/time -v echo |& grep Maximum
Maximum resident set size (kbytes): 2432
Some significant improvements in #45 on streaming the input. Fixing the excel output still remains though.
Not sure this broad issue is useful.
Users are reporting erratic behaviour when downloading large Twitter datasets.
One user:
Another: