Open frabcus opened 10 years ago
Profiling doing just one batch of 500. Using Python profiler (http://docs.python.org/2/library/profile.html) sorted by tottime.
Mon Dec 16 15:25:07 2013 profile.out
1788605 function calls (1782295 primitive calls) in 7.359 seconds
Ordered by: internal time
List reduced from 2061 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
396882 0.692 0.000 0.748 0.000 {isinstance}
80016 0.619 0.000 1.482 0.000 ./tool/create_downloads.py:182(insert)
106 0.618 0.006 0.618 0.006 {built-in method read}
80016 0.501 0.000 0.872 0.000 ./tool/pyexcelerate/DataTypes.py:27(get_type)
160032 0.490 0.000 0.753 0.000 ./tool/create_downloads.py:78(get_cell_span_content)
85377 0.477 0.000 0.694 0.000 {method 'extend' of 'list' objects}
5001 0.307 0.000 1.828 0.000 ./tool/create_downloads.py:318(write_row)
5001 0.305 0.000 2.914 0.001 ./tool/create_downloads.py:173(write_row)
80040 0.280 0.000 0.280 0.000 /usr/lib/python2.7/collections.py:54(__setitem__)
80016 0.277 0.000 1.150 0.000 ./tool/pyexcelerate/Worksheet.py:84(set_cell_value)
248532/247939 0.264 0.000 0.265 0.000 {len}
80016 0.251 0.000 0.457 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
160032 0.201 0.000 0.201 0.000 ./tool/create_downloads.py:194(<genexpr>)
5041 0.187 0.000 0.566 0.000 /usr/lib/python2.7/_abcoll.py:483(update)
5002 0.163 0.000 1.581 0.000 ./tool/create_downloads.py:404(make_table)
2 0.148 0.074 0.809 0.405 /usr/lib/python2.7/json/decoder.py:371(raw_decode)
5001 0.143 0.000 0.607 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
46455/46446 0.132 0.000 0.140 0.000 {method 'encode' of 'unicode' objects}
81849 0.105 0.000 0.105 0.000 {method 'get' of 'dict' objects}
5001 0.085 0.000 0.085 0.000 {method 'writerow' of '_csv.writer' objects}
And the same with Excel generation disabled
Mon Dec 16 15:37:54 2013 profile.out2
1257021 function calls (1250711 primitive calls) in 4.259 seconds
Ordered by: internal time
List reduced from 2054 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
106 0.578 0.005 0.578 0.005 {built-in method read}
80016 0.413 0.000 1.037 0.000 ./tool/create_downloads.py:182(insert)
85377 0.328 0.000 0.491 0.000 {method 'extend' of 'list' objects}
5001 0.224 0.000 2.050 0.000 ./tool/create_downloads.py:173(write_row)
80040 0.207 0.000 0.207 0.000 /usr/lib/python2.7/collections.py:54(__setitem__)
156785 0.181 0.000 0.226 0.000 {isinstance}
80016 0.179 0.000 0.277 0.000 ./tool/create_downloads.py:78(get_cell_span_content)
80016 0.178 0.000 0.313 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
202099/201506 0.173 0.000 0.174 0.000 {len}
160032 0.150 0.000 0.150 0.000 ./tool/create_downloads.py:194(<genexpr>)
5041 0.140 0.000 0.424 0.000 /usr/lib/python2.7/_abcoll.py:483(update)
2 0.110 0.055 0.604 0.302 /usr/lib/python2.7/json/decoder.py:371(raw_decode)
5001 0.100 0.000 0.419 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
5002 0.095 0.000 1.255 0.000 ./tool/create_downloads.py:404(make_table)
81849 0.084 0.000 0.084 0.000 {method 'get' of 'dict' objects}
46455/46446 0.083 0.000 0.093 0.000 {method 'encode' of 'unicode' objects}
5001 0.058 0.000 0.058 0.000 {method 'writerow' of '_csv.writer' objects}
12 0.051 0.004 0.051 0.004 {method 'commit' of 'sqlite3.Connection' objects}
5023 0.050 0.000 0.497 0.000 /usr/lib/python2.7/collections.py:38(__init__)
3 0.034 0.011 0.034 0.011 {_ssl.sslwrap}
Notice the wall clock time is 3 seconds quicker. I've filed https://github.com/scraperwiki/spreadsheet-download-tool/issues/64 in response to that.
With Excel still disabled... I then removed the colspan/rowspan code. Replacing most of the end of write_row (from i = 0
) with
output_row = []
for cell in row:
(rowspan, colspan), content = get_cell_span_content(cell)
output_row.append(content)
It's still a bit slow in write_row, but not in read. And overall it is about a second longer - even when I run the version with the colspan lots of times so the cache is full it takes least 3.8 seconds.
Mon Dec 16 19:47:38 2013 profile.out2
841935 function calls (835625 primitive calls) in 2.861 seconds
Ordered by: internal time
List reduced from 2051 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
106 0.575 0.005 0.575 0.005 {built-in method read}
80040 0.197 0.000 0.197 0.000 /usr/lib/python2.7/collections.py:54(__setitem__)
5001 0.171 0.000 0.898 0.000 ./tool/create_downloads.py:173(write_row)
80016 0.161 0.000 0.274 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
156785 0.150 0.000 0.192 0.000 {isinstance}
80016 0.149 0.000 0.226 0.000 ./tool/create_downloads.py:78(get_cell_span_content)
5041 0.135 0.000 0.406 0.000 /usr/lib/python2.7/_abcoll.py:483(update)
2 0.105 0.052 0.579 0.289 /usr/lib/python2.7/json/decoder.py:371(raw_decode)
5001 0.087 0.000 0.365 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
5002 0.083 0.000 1.197 0.000 ./tool/create_downloads.py:410(make_table)
95951 0.077 0.000 0.077 0.000 {method 'append' of 'list' objects}
81849 0.073 0.000 0.073 0.000 {method 'get' of 'dict' objects}
46455/46446 0.066 0.000 0.076 0.000 {method 'encode' of 'unicode' objects}
5023 0.047 0.000 0.476 0.000 /usr/lib/python2.7/collections.py:38(__init__)
12 0.045 0.004 0.045 0.004 {method 'commit' of 'sqlite3.Connection' objects}
5001 0.042 0.000 0.042 0.000 {method 'writerow' of '_csv.writer' objects}
42067/41474 0.034 0.000 0.035 0.000 {len}
3 0.026 0.009 0.026 0.009 {_ssl.sslwrap}
580/144 0.024 0.000 0.096 0.001 /usr/lib/python2.7/sre_parse.py:380(_parse)
5095 0.024 0.000 0.042 0.000 /usr/lib/python2.7/abc.py:128(__instancecheck__)
I've filed this bug for the colspan speed hting https://github.com/scraperwiki/spreadsheet-download-tool/issues/66
BTW, to run profiling
python -m cProfile -o profile.out ./tool/create_downloads.py
my profile output program:
import pstats
p = pstats.Stats('profile.out2')
p.sort_stats('tottime').print_stats(20)
I've noticed a few search tool instances which are regenerating every hour.
One thing to note is that even when something is erroring (e.g. Twitter rate limit) it will still cause its own update once an hour. So a factor of maybe 2 or so win might be to only regenerate on success, not failure.
Not sure if that would have unexpected behaviour for anyone though.
I think something is making it use excess CPU that is low hanging.