scraperwiki / spreadsheet-download-tool

A ScraperWiki plugin for downloading data from a box as a CSV or Excel spreadsheet
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Profile it and identify obvious bottlenecks #65

Open frabcus opened 10 years ago

frabcus commented 10 years ago

I think something is making it use excess CPU that is low hanging.

frabcus commented 10 years ago

Profiling doing just one batch of 500. Using Python profiler (http://docs.python.org/2/library/profile.html) sorted by tottime.

Mon Dec 16 15:25:07 2013    profile.out

         1788605 function calls (1782295 primitive calls) in 7.359 seconds

   Ordered by: internal time
   List reduced from 2061 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   396882    0.692    0.000    0.748    0.000 {isinstance}
    80016    0.619    0.000    1.482    0.000 ./tool/create_downloads.py:182(insert)
      106    0.618    0.006    0.618    0.006 {built-in method read}
    80016    0.501    0.000    0.872    0.000 ./tool/pyexcelerate/DataTypes.py:27(get_type)
   160032    0.490    0.000    0.753    0.000 ./tool/create_downloads.py:78(get_cell_span_content)
    85377    0.477    0.000    0.694    0.000 {method 'extend' of 'list' objects}
     5001    0.307    0.000    1.828    0.000 ./tool/create_downloads.py:318(write_row)
     5001    0.305    0.000    2.914    0.001 ./tool/create_downloads.py:173(write_row)
    80040    0.280    0.000    0.280    0.000 /usr/lib/python2.7/collections.py:54(__setitem__)
    80016    0.277    0.000    1.150    0.000 ./tool/pyexcelerate/Worksheet.py:84(set_cell_value)
248532/247939    0.264    0.000    0.265    0.000 {len}
    80016    0.251    0.000    0.457    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
   160032    0.201    0.000    0.201    0.000 ./tool/create_downloads.py:194(<genexpr>)
     5041    0.187    0.000    0.566    0.000 /usr/lib/python2.7/_abcoll.py:483(update)
     5002    0.163    0.000    1.581    0.000 ./tool/create_downloads.py:404(make_table)
        2    0.148    0.074    0.809    0.405 /usr/lib/python2.7/json/decoder.py:371(raw_decode)
     5001    0.143    0.000    0.607    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
46455/46446    0.132    0.000    0.140    0.000 {method 'encode' of 'unicode' objects}
    81849    0.105    0.000    0.105    0.000 {method 'get' of 'dict' objects}
     5001    0.085    0.000    0.085    0.000 {method 'writerow' of '_csv.writer' objects}
frabcus commented 10 years ago

And the same with Excel generation disabled

Mon Dec 16 15:37:54 2013    profile.out2

         1257021 function calls (1250711 primitive calls) in 4.259 seconds

   Ordered by: internal time
   List reduced from 2054 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      106    0.578    0.005    0.578    0.005 {built-in method read}
    80016    0.413    0.000    1.037    0.000 ./tool/create_downloads.py:182(insert)
    85377    0.328    0.000    0.491    0.000 {method 'extend' of 'list' objects}
     5001    0.224    0.000    2.050    0.000 ./tool/create_downloads.py:173(write_row)
    80040    0.207    0.000    0.207    0.000 /usr/lib/python2.7/collections.py:54(__setitem__)
   156785    0.181    0.000    0.226    0.000 {isinstance}
    80016    0.179    0.000    0.277    0.000 ./tool/create_downloads.py:78(get_cell_span_content)
    80016    0.178    0.000    0.313    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
202099/201506    0.173    0.000    0.174    0.000 {len}
   160032    0.150    0.000    0.150    0.000 ./tool/create_downloads.py:194(<genexpr>)
     5041    0.140    0.000    0.424    0.000 /usr/lib/python2.7/_abcoll.py:483(update)
        2    0.110    0.055    0.604    0.302 /usr/lib/python2.7/json/decoder.py:371(raw_decode)
     5001    0.100    0.000    0.419    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
     5002    0.095    0.000    1.255    0.000 ./tool/create_downloads.py:404(make_table)
    81849    0.084    0.000    0.084    0.000 {method 'get' of 'dict' objects}
46455/46446    0.083    0.000    0.093    0.000 {method 'encode' of 'unicode' objects}
     5001    0.058    0.000    0.058    0.000 {method 'writerow' of '_csv.writer' objects}
       12    0.051    0.004    0.051    0.004 {method 'commit' of 'sqlite3.Connection' objects}
     5023    0.050    0.000    0.497    0.000 /usr/lib/python2.7/collections.py:38(__init__)
        3    0.034    0.011    0.034    0.011 {_ssl.sslwrap}

Notice the wall clock time is 3 seconds quicker. I've filed https://github.com/scraperwiki/spreadsheet-download-tool/issues/64 in response to that.

frabcus commented 10 years ago

With Excel still disabled... I then removed the colspan/rowspan code. Replacing most of the end of write_row (from i = 0) with

        output_row = []
        for cell in row:
            (rowspan, colspan), content = get_cell_span_content(cell)
            output_row.append(content)

It's still a bit slow in write_row, but not in read. And overall it is about a second longer - even when I run the version with the colspan lots of times so the cache is full it takes least 3.8 seconds.

Mon Dec 16 19:47:38 2013    profile.out2

         841935 function calls (835625 primitive calls) in 2.861 seconds

   Ordered by: internal time
   List reduced from 2051 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      106    0.575    0.005    0.575    0.005 {built-in method read}
    80040    0.197    0.000    0.197    0.000 /usr/lib/python2.7/collections.py:54(__setitem__)
     5001    0.171    0.000    0.898    0.000 ./tool/create_downloads.py:173(write_row)
    80016    0.161    0.000    0.274    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
   156785    0.150    0.000    0.192    0.000 {isinstance}
    80016    0.149    0.000    0.226    0.000 ./tool/create_downloads.py:78(get_cell_span_content)
     5041    0.135    0.000    0.406    0.000 /usr/lib/python2.7/_abcoll.py:483(update)
        2    0.105    0.052    0.579    0.289 /usr/lib/python2.7/json/decoder.py:371(raw_decode)
     5001    0.087    0.000    0.365    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
     5002    0.083    0.000    1.197    0.000 ./tool/create_downloads.py:410(make_table)
    95951    0.077    0.000    0.077    0.000 {method 'append' of 'list' objects}
    81849    0.073    0.000    0.073    0.000 {method 'get' of 'dict' objects}
46455/46446    0.066    0.000    0.076    0.000 {method 'encode' of 'unicode' objects}
     5023    0.047    0.000    0.476    0.000 /usr/lib/python2.7/collections.py:38(__init__)
       12    0.045    0.004    0.045    0.004 {method 'commit' of 'sqlite3.Connection' objects}
     5001    0.042    0.000    0.042    0.000 {method 'writerow' of '_csv.writer' objects}
42067/41474    0.034    0.000    0.035    0.000 {len}
        3    0.026    0.009    0.026    0.009 {_ssl.sslwrap}
  580/144    0.024    0.000    0.096    0.001 /usr/lib/python2.7/sre_parse.py:380(_parse)
     5095    0.024    0.000    0.042    0.000 /usr/lib/python2.7/abc.py:128(__instancecheck__)
frabcus commented 10 years ago

I've filed this bug for the colspan speed hting https://github.com/scraperwiki/spreadsheet-download-tool/issues/66

frabcus commented 10 years ago

BTW, to run profiling

python -m cProfile -o profile.out ./tool/create_downloads.py 

my profile output program:

import pstats
p = pstats.Stats('profile.out2')
p.sort_stats('tottime').print_stats(20)
frabcus commented 10 years ago

I've noticed a few search tool instances which are regenerating every hour.

One thing to note is that even when something is erroring (e.g. Twitter rate limit) it will still cause its own update once an hour. So a factor of maybe 2 or so win might be to only regenerate on success, not failure.

Not sure if that would have unexpected behaviour for anyone though.