Closed pmusaraj closed 6 years ago
This became an urgent issue when it started hampering OCHA efforts in Syria. On an account with 138 surveys, https://kobo.humanitarianresponse.info/assets/?q=asset_type%3Asurvey&limit=200
consistently takes around 30 seconds.
Just listing the assets for the account is very taxing on the backend. All from ./manage.py shell_plus
in the KPI container:
First, how much memory is the shell using?
In [1]: import os
...: os.system('top -n 1 -b -p {}'.format(os.getpid()))
...:
top - 16:35:43 up 271 days, 18:47, 0 users, load average: 1.62, 1.88, 2.15
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.9 us, 0.3 sy, 0.0 ni, 92.5 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 16431508 total, 7824228 free, 6920400 used, 1686880 buff/cache
KiB Swap: 8388604 total, 7667220 free, 721384 used. 9066152 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29252 root 20 0 435020 91896 15060 S 0.0 0.6 0:01.47 python
Out[1]: 0
Let's simulate the asset list request:
In [2]: from django.test import RequestFactory
...: from django.core.urlresolvers import resolve, reverse
...: request_factory = RequestFactory(SERVER_NAME='kobo.humanitarianresponse.info')
...: req = request_factory.get(reverse('asset-list'))
...: req.user = User.objects.get(username='<snip>')
...: req.GET = {'q': 'asset_type:survey', 'limit': '200'}
...: %time resp = resolve(req.path).func(req)
...:
CPU times: user 17.7 s, sys: 1.6 s, total: 19.3 s
Wall time: 25.8 s
How much memory are we using now?
In [3]: os.system('top -n 1 -b -p {}'.format(os.getpid()))
top - 16:38:31 up 271 days, 18:50, 0 users, load average: 0.41, 1.26, 1.87
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.9 us, 0.3 sy, 0.0 ni, 92.5 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 16431508 total, 5470644 free, 9268368 used, 1692496 buff/cache
KiB Swap: 8388604 total, 7667220 free, 721384 used. 6718076 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29252 root 20 0 3285076 2.731g 15296 S 0.0 17.4 0:21.45 python
Out[3]: 0
Ouch, that's 17.4% of the memory on a server with 16G of RAM and 8G of swap. Also, the vast majority of the 25.8 elapsed seconds were used actually executing Python code, not waiting for the database.
Here's a quick and dirty way to get the user's assets into my local development environment. In production server's Django shell:
u = User.objects.get(username='<snip>')
stuff = {}
stuff['user'] = u
stuff['assets'] = list(Asset.objects.filter(owner=u))
f = open('/tmp/briefcase', 'w')
f.write(pickle.dumps(stuff)) # writes 186M for this user with 139 assets
f.close()
In local Django shell (with a database I didn't mind stomping all over):
f = open('briefcase')
stuff = pickle.loads(f.read())
f.close()
stuff['user'].save()
for a in stuff['assets']: print a, a.save()
# it's nice to be able to log in as the user
a.owner.set_password('something easy')
a.owner.save()
Profiling with https://github.com/jazzband/django-debug-toolbar:
docker-compose exec nginx ip a s
(ignore the lo
interface, get the IP of the other one);
172.17.0.10
;INTERNAL_IPS = ('172.17.0.10',)
(change the IP to match the previous step) in settings.py
;pip install djdt-flamegraph
) (yikes, don't steal my DNA);settings.py
, add the flame-graph panel (and include all the default panels... yes, there's probably a better way; look at global_settings
):
DEBUG_TOOLBAR_PANELS = [
'djdt_flamegraph.FlamegraphPanel',
'debug_toolbar.panels.versions.VersionsPanel',
'debug_toolbar.panels.timer.TimerPanel',
'debug_toolbar.panels.settings.SettingsPanel',
'debug_toolbar.panels.headers.HeadersPanel',
'debug_toolbar.panels.request.RequestPanel',
'debug_toolbar.panels.sql.SQLPanel',
'debug_toolbar.panels.staticfiles.StaticFilesPanel',
'debug_toolbar.panels.templates.TemplatesPanel',
'debug_toolbar.panels.cache.CachePanel',
'debug_toolbar.panels.signals.SignalsPanel',
'debug_toolbar.panels.logging.LoggingPanel',
'debug_toolbar.panels.redirects.RedirectsPanel',
]
./manage.py runserver --nothreading --noreload
. For other stuff (like SQL profiling), regular ./manage.py runserver
is fine (and auto reloading sure is nice).At this point, the flame graph isn't terribly enlightening to my eyes:
...but all the duplicated SQL queries give a place to start:
Seeing "kpi_asset"."content"
being retrieved from the database for a list view is a red flag:
It's helpful to omit unneeded fields from a list serializer, but that doesn't mean they won't be retrieved from the database (and deserialized in the case of JSONField
s, which are stored as plain text). Some convoluted SQL shows that the widest columns are content
, at ~877K per asset, followed by report_styles
at ~53K, _deployment_data
at ~2K, and summary
at ~1K:
root@postgres:/# su -c psql postgres
psql (9.4.17)
Type "help" for help.
postgres=# \c kobotoolbox
You are now connected to database "kobotoolbox" as user "postgres".
kobotoolbox=# \x
Expanded display is on.
kobotoolbox=# select
avg(length(cast(kpi_asset.id as text))) as id,
avg(length(cast(name as text))) as name,
avg(length(cast(date_created as text))) as date_created,
avg(length(cast(date_modified as text))) as date_modified,
avg(length(cast(content as text))) as content,
avg(length(cast(asset_type as text))) as asset_type,
avg(length(cast(editors_can_change_permissions as text))) as editors_can_change_permissions,
avg(length(cast(uid as text))) as uid,
avg(length(cast(owner_id as text))) as owner_id,
avg(length(cast(parent_id as text))) as parent_id,
avg(length(cast(summary as text))) as summary,
avg(length(cast(_deployment_data as text))) as _deployment_data,
avg(length(cast(report_styles as text))) as report_styles,
avg(length(cast(settings as text))) as settings,
avg(length(cast(report_custom as text))) as report_custom,
avg(length(cast(map_custom as text))) as map_custom,
avg(length(cast(map_styles as text))) as map_styles
from kpi_asset inner join auth_user on kpi_asset.owner_id = auth_user.id where auth_user.username = '<snip>';
-[ RECORD 1 ]------------------+-----------------------
id | 6.5804195804195804
name | 44.9440559440559441
date_created | 28.8601398601398601
date_modified | 28.9160839160839161
content | 872571.902097902098
asset_type | 5.9790209790209790
editors_can_change_permissions | 4.0000000000000000
uid | 22.0000000000000000
owner_id | 5.0000000000000000
parent_id | 1.00000000000000000000
summary | 747.9230769230769231
_deployment_data | 1769.1048951048951049
report_styles | 53081.622377622378
settings | 3.9860139860139860
report_custom | 2.0000000000000000
map_custom | 2.0000000000000000
map_styles | 2.0000000000000000
Since AssetListSerializer
doesn't ask for content
or report_styles
, we should make sure they're defer()
red from the ORM query.
Pitfall: after using defer()
, test_collection_can_have_asset
is failing because _count_children_by_kind()
is looking for asset
, but defer()
has changed the type to asset_deferred_content_report_styles
Asset.kind
is flawed and returns asset_deferred_content_report_styles
after using defer()
.
API calls to
assets/?q=asset_type%3Asurvey&limit=200
in an account with 200 forms take ~15 secs on a test server.Introduced after the temporary switch to 200 records per API call to fix #1380.