spender-sandbox / cuckoo-modified

Modified edition of cuckoo
395 stars 178 forks source link

webui analysis page lockups when cuckoo has ~20,000 of tasks? #82

Open mallorybobalice opened 8 years ago

mallorybobalice commented 8 years ago

Hi,

I guess the really short version is - we're seeing extremely degraded Webui performance for large numbers of tasks in the db.What can we do to further diagnose it ? (other than python cuckoo.py --clean)

For the django UI search (say 2-3 mins) and individual web ui task results pages (30s-5mins for properly processed files). Api seems fine for json reports for the same pages. reporting is mongo, cuckoo is mysql. Plenty of CPUs and memory for the server. Latest source, cuckoo-modified, community-modified, apt-get updates , pip lib updates.

Retention is configured but the performance for search and analysis loading is very bad.

bit more info is below

mallorybobalice commented 8 years ago

server info: large cuckoo db - about 17000 files in it. Ubuntu LTS server 12.0.4.5 Latest OS/lib packages latest libs/python libs via pip and code as of 31st of April (post Brad's commits today). reporting is html + json + mongo. cuckoo db is mysql. Server is a VM but has 8 cores allocated and 24gb ram and the physical host is dedicated to the cuckoo server + VMs.

same behaviour is seen with a mid feb build. works ok after python --clean but long term we'd like to rely on reporting.py (that we have configured for about a month of retention)

bit more on the behaviour web UI /analysis loads (recent), pending loads, search loads . Search results load but are very slow (about 2-3 mins for filename searches). individual module av result pages if accessed directly load. extracting the task report via api works and is quick e.g. http://xxxx:8080/api/tasks/get/report/17699/

failed analysis pages open almost instantly. Some completed files seem to open quickly as well. e.g. where the exe is corrupted Error: Analysis failed: The package "modules.packages.exe" start function raised an error: Unable to execute the initial process, analysis aborted. not type specific (pdf, exe, xls, etc)'

BUT

for most completed and reported analysis reports, the page hangs on opening. http://xx:8080/analysis/16588/ for several minutes

top looks a bit like this when this happens.

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2063 cuckoo    20   0  936m 119m 7784 S   93  0.5   0:52.32 /usr/bin/python manage.py runserver xxx:8080
 1822 mysql     20   0 1185m  88m 7776 S   16  0.4   1:36.95 /usr/sbin/mysqld
 1562 mongodb   20   0  137g 4.4g 4.3g S    5 18.5   1:00.91 /usr/bin/mongod --config /etc/mongodb.conf
 1912 cuckoo    20   0  603m 121m  11m S    3  0.5   0:18.40 python cuckoo.py --debug
 2091 root      20   0     0    0    0 S    1  0.0   0:00.14 [kworker/u16:2]
    1 root      20   0 24460 2432 1336 S    0  0.0   0:03.04 /sbin/init

most of the time it seems to eventually load the analysis page. for the entries that hang - nothing interesting in chrome - sends the request and hangs waiting.

i can't see anything interesting in /var/log/upstart/cuckoomain.log

/var/log/upstart/cuckooresproc.log

/var/log/upstart/cuckooweb.log bunch of RemovedInDjango110Warning but April 01, 2016 - 20:29:01 Django version 1.9.4, using settings 'web.settings' (previously was using 1.9.1 -same issue) Starting development server at http://xx:8080/ Quit the server with CONTROL-C.

cuckooweb is chdir xxx/web exec python manage.py runserver xx:8080

I hoped to get a bit more debug info with ./manage.py runserver --verbosity 3 but no luck

it seems to just show requests after they stop hanging? <-- not the right way of getting debug info out of it ?

mallorybobalice commented 8 years ago

retention is working

cat log/process.log.1| grep retention | wc 12522 176810 2490182 016-04-01 15:19:36,126 [modules.reporting.retention] DEBUG: Task #17651 deleting /xxxx/storage/analyses/6589/dump_sorted.pcap due to retention quota .....

settings are

[retention] enabled = yes memory = 5 procmemory = 5 pcap = 10 sortedpcap = 21 bsonlogs = 10 dropped = 10 screencaps = 21 reports = 21 mongo = 90 elastic = no malheur = no #tho we do have it configured

disk is stable at about 67% of 500gb.

mallorybobalice commented 8 years ago

anyone? probably have to cuckoo.py --clean it otherwise. bit of a pity to do that after waiting for so long. was hoping could get some advice on either figuring out why the issue occurs and if it's a bug , or some basic tuning anyone's done for mysql / mongo (presumably mysql is the bottleneck here given api returns task info nearly instantly)....

spender-sandbox commented 8 years ago

This would probably be a good start: https://docs.mongodb.org/manual/tutorial/manage-the-database-profiler/

mallorybobalice commented 8 years ago

thanks Brad. appreciate the reply. that's a better link re profiler than what i found in the other 'issue'

if you don't mind me asking, a) any particular output you're looking for, or just read the article, get a feel of it and I suppose, disable cuckoomain, cuckooresproc and cuckooapi leaving just cuckooweb, toggle the profiling flag, run refresh an offending task page, wait for a bit, then post you the output of db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty() ? b) why do you think it's mongodb not mysql given mysql cpu usage is higher and stays like that for the duration of the query + freeze-up ( tho i'm not certain how mysql db is used and though reporting info just went to mongo ). or is the mysql cpu usage alongside a red herring? any point trying to break and inspect what python is doing while consuming the most cpu? or just relax, post the mongo output and start there :) ?

spender-sandbox commented 8 years ago

Mysql is storing next to no data, I would be surprised if that was the limiting factor. But run some profiling on that as well so we can see what's going on.

-Brad

mallorybobalice commented 8 years ago

sure can do both tomorrow. would you mind sharing a couple of lines for what to do for mysql profiling as well? (if you have something handy-ish. not sure if you'd have mysql or postgre in your usual use).

by the way, any specific reason why I'd want to consider switching to posgresql from mysql (not really for this, just in general) or no strong reasons as such you can think of (sounds like hardly stores any data = probably not) ? (ownership by oracle aside i suppose)

spender-sandbox commented 8 years ago

I don't know anything about mysql profiling, your guess would be as good as mine. I wouldn't recommend switching to postgresql, I think there are some unfixed problems with support for it.

-Brad

mallorybobalice commented 8 years ago

I see, thanks Brad.

I suspect the issue is related to the similar tabs list maybe.

What determines the 'similar' tab list ? ( i thought malheur but ...I don't know for me it looks to display most pdfs as 'similar '...and all docs are similar to the rest of the docs. regardless of malscore, iocs or signatures . that doesn't seem right?)
seems like most reports that take years to load are the ones where similar count is high.

curiously, the more benign files end up in the db, the longer other benign files take to load in my case most pdfs = are similar to most other pdfs - mostly benign at 14700 files .... most docs are similar to most of the other 5000 docs regardless of signatures, malscore or indicators... Docs load a bit quicker. a specific exe triggering Hawkeye only has 4 other similar files and loads almost instantly .

mongoexport -d cuckoo -c system.profile --out export is attached.

exportPP.txt please let me know if you want more info

spender-sandbox commented 8 years ago

There, just added something that should help with that. With that volume of samples though, Malheur is going to take a long time on each analysis.

-Brad

mallorybobalice commented 8 years ago

Thanks Brad, for loading individual tasks that definitely helps.

a couple more questions:

a) does the number similar files (nearly all per filetype for low malscore score files) seem like a concern (for more specific and malicious files it seems lower . hmm) ?

b) For the searching - I had a quick look at the code and basically it seems like any extended search uses regex, so https://docs.mongodb.org/manual/reference/operator/query/regex/ is required reading... When I don't restrict the regexes as per the doc, I get 90+ second filename searches for a specific filename (of course you can restrict it with ^afilename.doc) . IMO that should be part of a footnote on the search page.

c) >Malheur is going to take a long time on each analysis. - do you mean for display of a task or during initial reporting when malheur is invoked?

d) for the limit similar numbers last X by date (or task number as a proxy), correct? (not by 'similarity' )