Slow API call to asset list when limit set to 200

pmusaraj commented 7 years ago

API calls to assets/?q=asset_type%3Asurvey&limit=200 in an account with 200 forms take ~15 secs on a test server.

Introduced after the temporary switch to 200 records per API call to fix #1380.

jnm commented 6 years ago

This became an urgent issue when it started hampering OCHA efforts in Syria. On an account with 138 surveys, https://kobo.humanitarianresponse.info/assets/?q=asset_type%3Asurvey&limit=200 consistently takes around 30 seconds.

Just listing the assets for the account is very taxing on the backend. All from ./manage.py shell_plus in the KPI container:

First, how much memory is the shell using?

In [1]: import os
   ...: os.system('top -n 1 -b -p {}'.format(os.getpid()))
   ...: 
top - 16:35:43 up 271 days, 18:47,  0 users,  load average: 1.62, 1.88, 2.15
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.9 us,  0.3 sy,  0.0 ni, 92.5 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16431508 total,  7824228 free,  6920400 used,  1686880 buff/cache
KiB Swap:  8388604 total,  7667220 free,   721384 used.  9066152 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29252 root      20   0  435020  91896  15060 S   0.0  0.6   0:01.47 python
Out[1]: 0

Let's simulate the asset list request:

In [2]: from django.test import RequestFactory
   ...: from django.core.urlresolvers import resolve, reverse
   ...: request_factory = RequestFactory(SERVER_NAME='kobo.humanitarianresponse.info')
   ...: req = request_factory.get(reverse('asset-list'))
   ...: req.user = User.objects.get(username='<snip>')
   ...: req.GET = {'q': 'asset_type:survey', 'limit': '200'}
   ...: %time resp = resolve(req.path).func(req)
   ...: 
CPU times: user 17.7 s, sys: 1.6 s, total: 19.3 s
Wall time: 25.8 s

How much memory are we using now?

In [3]: os.system('top -n 1 -b -p {}'.format(os.getpid()))
top - 16:38:31 up 271 days, 18:50,  0 users,  load average: 0.41, 1.26, 1.87
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.9 us,  0.3 sy,  0.0 ni, 92.5 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16431508 total,  5470644 free,  9268368 used,  1692496 buff/cache
KiB Swap:  8388604 total,  7667220 free,   721384 used.  6718076 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29252 root      20   0 3285076 2.731g  15296 S   0.0 17.4   0:21.45 python
Out[3]: 0

Ouch, that's 17.4% of the memory on a server with 16G of RAM and 8G of swap. Also, the vast majority of the 25.8 elapsed seconds were used actually executing Python code, not waiting for the database.

jnm commented 6 years ago

Here's a quick and dirty way to get the user's assets into my local development environment. In production server's Django shell:

u = User.objects.get(username='<snip>')
stuff = {}
stuff['user'] = u
stuff['assets'] = list(Asset.objects.filter(owner=u))
f = open('/tmp/briefcase', 'w')
f.write(pickle.dumps(stuff)) # writes 186M for this user with 139 assets
f.close()

In local Django shell (with a database I didn't mind stomping all over):

f = open('briefcase')
stuff = pickle.loads(f.read())
f.close()
stuff['user'].save()
for a in stuff['assets']: print a, a.save()
# it's nice to be able to log in as the user
a.owner.set_password('something easy')
a.owner.save()

jnm commented 6 years ago

Profiling with https://github.com/jazzband/django-debug-toolbar:

Assuming the development environment uses Docker, get the IP of the NGINX container: docker-compose exec nginx ip a s (ignore the lo interface, get the IP of the other one);
- mine is 172.17.0.10;
Set INTERNAL_IPS = ('172.17.0.10',) (change the IP to match the previous step) in settings.py;
Install https://github.com/23andMe/djdt-flamegraph (pip install djdt-flamegraph) (yikes, don't steal my DNA);

In settings.py, add the flame-graph panel (and include all the default panels... yes, there's probably a better way; look at global_settings):

DEBUG_TOOLBAR_PANELS = [
    'djdt_flamegraph.FlamegraphPanel',
    'debug_toolbar.panels.versions.VersionsPanel',
    'debug_toolbar.panels.timer.TimerPanel',
    'debug_toolbar.panels.settings.SettingsPanel',
    'debug_toolbar.panels.headers.HeadersPanel',
    'debug_toolbar.panels.request.RequestPanel',
    'debug_toolbar.panels.sql.SQLPanel',
    'debug_toolbar.panels.staticfiles.StaticFilesPanel',
    'debug_toolbar.panels.templates.TemplatesPanel',
    'debug_toolbar.panels.cache.CachePanel',
    'debug_toolbar.panels.signals.SignalsPanel',
    'debug_toolbar.panels.logging.LoggingPanel',
    'debug_toolbar.panels.redirects.RedirectsPanel',
]

To see the flame graph, it's annoyingly necessary to ./manage.py runserver --nothreading --noreload. For other stuff (like SQL profiling), regular ./manage.py runserver is fine (and auto reloading sure is nice).

jnm commented 6 years ago

At this point, the flame graph isn't terribly enlightening to my eyes: screenshot_2018-07-18_18-29-04

...but all the duplicated SQL queries give a place to start: screenshot_2018-07-18_18-30-30

Seeing "kpi_asset"."content" being retrieved from the database for a list view is a red flag:

jnm commented 6 years ago

It's helpful to omit unneeded fields from a list serializer, but that doesn't mean they won't be retrieved from the database (and deserialized in the case of JSONFields, which are stored as plain text). Some convoluted SQL shows that the widest columns are content, at ~877K per asset, followed by report_styles at ~53K, _deployment_data at ~2K, and summary at ~1K:

root@postgres:/# su -c psql postgres
psql (9.4.17)
Type "help" for help.

postgres=# \c kobotoolbox
You are now connected to database "kobotoolbox" as user "postgres".
kobotoolbox=# \x
Expanded display is on.
kobotoolbox=# select                                                                          
avg(length(cast(kpi_asset.id as text))) as id,                                            
avg(length(cast(name as text))) as name,                                        
avg(length(cast(date_created as text))) as date_created,                        
avg(length(cast(date_modified as text))) as date_modified,                      
avg(length(cast(content as text))) as content,                                  
avg(length(cast(asset_type as text))) as asset_type,                            
avg(length(cast(editors_can_change_permissions as text))) as editors_can_change_permissions,
avg(length(cast(uid as text))) as uid,                                          
avg(length(cast(owner_id as text))) as owner_id,                                
avg(length(cast(parent_id as text))) as parent_id,                              
avg(length(cast(summary as text))) as summary,                                  
avg(length(cast(_deployment_data as text))) as _deployment_data,                
avg(length(cast(report_styles as text))) as report_styles,                      
avg(length(cast(settings as text))) as settings,                                
avg(length(cast(report_custom as text))) as report_custom,                      
avg(length(cast(map_custom as text))) as map_custom,                            
avg(length(cast(map_styles as text))) as map_styles                             
from kpi_asset inner join auth_user on kpi_asset.owner_id = auth_user.id where auth_user.username = '<snip>';
-[ RECORD 1 ]------------------+-----------------------
id                             | 6.5804195804195804
name                           | 44.9440559440559441
date_created                   | 28.8601398601398601
date_modified                  | 28.9160839160839161
content                        | 872571.902097902098
asset_type                     | 5.9790209790209790
editors_can_change_permissions | 4.0000000000000000
uid                            | 22.0000000000000000
owner_id                       | 5.0000000000000000
parent_id                      | 1.00000000000000000000
summary                        | 747.9230769230769231
_deployment_data               | 1769.1048951048951049
report_styles                  | 53081.622377622378
settings                       | 3.9860139860139860
report_custom                  | 2.0000000000000000
map_custom                     | 2.0000000000000000
map_styles                     | 2.0000000000000000

Since AssetListSerializer doesn't ask for content or report_styles, we should make sure they're defer()red from the ORM query.

jnm commented 6 years ago

Pitfall: after using defer(), test_collection_can_have_asset is failing because ~~_count_children_by_kind() is looking for asset, but defer() has changed the type to asset_deferred_content_report_styles~~ Asset.kind is flawed and returns asset_deferred_content_report_styles after using defer().

kobotoolbox / kpi

Slow API call to asset list when limit set to 200 #1422