pombreda / amcat

Automatically exported from code.google.com/p/amcat
0 stars 0 forks source link

performance of extracting large nr of jobs #408

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
If I extract a large number of jobs (in casu: all jobs in project 29), even 
with a small number of actual rows (<1000s) it takes quite a while. On the 
'runserver' I put a trace in the 'getrows' function, this function alone took 
around 2 minutes to complete. 

The whole operation in the nginx -dev envrionment takes 72 seconds:

[pid: 19069|app: 0|req: 4/4] 77.126.99.91 () {50 vars in 1323 bytes} [Tue Apr 
23 11:39:48 2013] POST 
/navigator/project/29/codingjob/export-options?export_level=0&use_session=1 => 
generated 85848 bytes in 72710 msecs (HTTP/1.1 201) 3 headers in 222 bytes (2 
switches on core 0)

I thought it might be the metadata since we/you didn't optimize that yet, but 
unselecting all metadata columns doesn't really help:

[pid: 19069|app: 0|req: 5/5] 77.126.99.91 () {50 vars in 1323 bytes} [Tue Apr 
23 11:47:37 2013] POST 
/navigator/project/29/codingjob/export-options?export_level=0&use_session=1 => 
generated 56928 bytes in 69991 msecs (HTTP/1.1 201) 3 headers in 222 bytes (2 
switches on core 0)
[pid: 19069|app: 0|req: 6/6] 77.126.99.91 () {50 vars in 1323 bytes} [Tue Apr 
23 11:49:58 2013] POST 
/navigator/project/29/codingjob/export-options?export_level=0&use_session=1 => 
generated 46096 bytes in 70199 msecs (HTTP/1.1 201) 3 headers in 222 bytes (2 
switches on core 0)

Original issue reported on code.google.com by vanatteveldt@gmail.com on 23 Apr 2013 at 9:53

GoogleCodeExporter commented 9 years ago
Increasing the priority as I get a timeout if I extract these jobs with 
parents=2 for all codebook fields

Original comment by vanatteveldt@gmail.com on 23 Apr 2013 at 5:08

GoogleCodeExporter commented 9 years ago
(het gekke is, hij trekt de CPU op de webserver helemaal dicht, je zou zeggen 
dat als er een probleem is, het de database zou moeten zijn...)

Original comment by vanatteveldt@gmail.com on 23 Apr 2013 at 5:11

GoogleCodeExporter commented 9 years ago
OK, er is zowel een CPU probleem als een (kleiner) db probleem.

Ik heb trace prints gemaakt met daarin het aantal queries van de job ervoor. 
Jobs zonder coderingen geven 6 queries. Jobs met coderingen ( en niet eens heel 
veel, job 409 bv heeft 7 (!) gecodeerde artikelen hebben honderden queries:

wva@amcat3:~$ python extract.py > /tmp/data2.csv
[2013-04-23 20:58:05 extract.py:39 INFO] Job 0/103: 407, nqueries=2
[2013-04-23 20:58:05 extract.py:39 INFO] Job 1/103: 408, nqueries=6
[2013-04-23 20:58:06 extract.py:39 INFO] Job 2/103: 409, nqueries=248
[2013-04-23 20:58:07 extract.py:39 INFO] Job 3/103: 411, nqueries=414
[2013-04-23 20:58:07 extract.py:39 INFO] Job 4/103: 412, nqueries=6
[2013-04-23 20:58:07 extract.py:39 INFO] Job 5/103: 413, nqueries=6
[2013-04-23 20:58:07 extract.py:39 INFO] Job 6/103: 414, nqueries=6
[2013-04-23 20:58:09 extract.py:39 INFO] Job 7/103: 415, nqueries=399

Er is ook een cpu probleem, elke keer dat ik kill zat hij in  de get_parents. 
Dit ga ik nu workarounden met wat memoisatie en moet een keer opgelost worden.

File "/home/wva/amcat/amcat/models/coding/codebook.py", line 267, in _get_parent
    for child, parent in hierarchy.iteritems():

Original comment by vanatteveldt@gmail.com on 23 Apr 2013 at 7:47

GoogleCodeExporter commented 9 years ago
Scaling back to medium priority.

CPU issue was an infinite loop in getting the parent with a cyclical hierarchy, 
resolved by adding a loop detector. 

Still too many queries, but performance on the set we're using (~100 jobs, ~500 
coded articles) is acceptable.

Original comment by vanatteveldt@gmail.com on 23 Apr 2013 at 8:21

GoogleCodeExporter commented 9 years ago
Voor als je weer in NL bent: is dat een database waar je een dump van kan 
maken? Dan kan ik er (makkelijk) naar kijken.

Original comment by Martijn....@gmail.com on 25 Apr 2013 at 8:50

GoogleCodeExporter commented 9 years ago

Original comment by Martijn....@gmail.com on 25 Apr 2013 at 9:13