malheur report module take too long for more than 10k analyses

SeanKim777 commented 7 years ago

Hi Issue #193 seems created by malheur.py reporting module Check source here

I have found generating melheur report take too long. 14282 malheur dataset( == number of analysis) take more than 2mins. And it is expected to take more for more dataset.

$ ls -l ./test_dataset/ |wc -l 14282 $ time ./malheur -c ~/cuckoo/conf/malheur.conf -o ./test_increment.txt cluster ./test_dataset/ real 2m6.502s user 6m27.744s sys 0m1.729s

malheur seems using multi-CPU while it's executing by looking at CPU usage and above user time. It may hard to improving performance by updating malheur. (tested with malheur downloaded and compiled from malheur git repo on today 25 Oct 2016)

So if there are more than 10K analyses then malheur report module need to run on a regular basis rather than each analysis like retention module.

Any suggestion?

mallorybobalice commented 7 years ago

https://github.com/rieck/malheur/commits/master

Oct commit is not meaningful

I've been running the one prior to it on 30k /week rolling file sets for a while and yes, mh takes longer that processing and reporting heh(hell, most of what my server seems to bedoing is freezing on static pdf until timeout and malheur

malheur report module need to run on a regular basis rather than each analysis like retention module.

We need to talk to @spender-sandbox and @rieck

I'm unclear if we run it in incremental mode or not, and if we can. See https://github.com/rieck/malheur/issues/12

I'm unfamiliar with mh internals so if it parses all 30000 reports or md5 \txt files every file that probably explains why

mallorybobalice commented 7 years ago

Also whether it classifies using large reports, or signatures or summary reports

rieck commented 7 years ago

Hi all,

it's cool to see that you are looking at Malheur!

I am not sure what you are actually trying to achieve. Clustering 14k behavior reports in about 2 minutes is not that bad. When we implemented the tool in 2010 similar experiments took much longer (see page 22 here https://www.tu-braunschweig.de/Medien-DB/sec/pubs/2011-jcs.pdf)

The tool has been designed to run in an incremental mode, where you would take the catch of the day (freshly analyzed reports) and pass it to the tool for finding similarities with old samples as well as identify novel clusters. Depending on the amount of data, this will take some time. In comparison to the actual sandbox analysis time, however, this should be marginal.

The main problem I see with Malheur is long-time deployment. Over the time the collected clusters will grow and thus the tool will slow down. We have never explored this issue due to a lack of a stable data source (someone running thousands of malware samples through a sandbox on a regular basis).

Regards, Konrad

SeanKim777 commented 7 years ago

Thanks for information and comment @mallorybobalice and @rieck @rieck the reason why I have made this issue on cuckoo repo instead of malheur is 2 minutes processing time for 14K reports could be regarded as short. but if cuckoo need to processing thousands of suspicious samples in a short period of time, malheur processing part could become a bottleneck of entire process. and this is what I am currently having.

I have limited understanding and knowledge about malheur. So, Please correct me if I made any wrong approach to solve this issue.

I have not fully understand issue rieck/malheur#12 created by @spender-sandbox because I have got different result on incremental execution on my datasets. I have used man page example syntax and divided datasets into 3 chunks. malheur -c cuckoocfg -s /tmp/state -o out1.txt -v --reset increment dataset1.zip malheur -c cuckoocfg -s /tmp/state -o out2.txt -v increment dataset2.zip malheur -c cuckoocfg -s /tmp/state -o out3.txt -v increment dataset3.zip

I believe I have got similar result created by cuckoo malheur reporting module source and generated output will be the same result on Django web UI similar tab. Total execution time is almost similar between incremental and cluster mode. But as incremental will take less time for a chunk. I think executing malheur in a incremental mode could lower cuckoo report generating time. (means changing malheur.py to execute malheur in incremental mode and merge output into one result file)

@rieck according to your answer above, can I use this approach for the every new reports? eg. if cuckoo generated new analysis report on '1561_dir/1561.mist_file.txt' malheur -c cuckoocfg -s /tmp/state -o output_1561.txt -v increment 1561_dir then merge output_1561.txt into previously created and merged output. (this is only for rendering web UI) Is the same as malheur -c cuckoocfg -s /tmp/state -o full_dataset_result -v cluster dataset_full.zip?

another question: '-s' option. should I keep 'prototypes.zfa' and 'rejected.zfa' to get a same result for the incremental mode?

rieck commented 7 years ago

Okay, let me comment on these issues:

Maybe I a missing something. Generating 14k reports with Cuckoo Sandbox takes some time. For example, if we have 100 machines and each runs a sample for only 1 minute, we need at least 2 hours. Is the remaining 2 minutes needed by Malheur really an issue?
Malheur has not be designed to process individual reports. Note you cannot cluster a single report but only assign it to an existing cluster or wait until more data is available. Thus, it is better to collect reports and pass them to the tool in larger chunks.
Malheur will not merge any output logs, if I recall correctly. However, it will generate a global state at the location provided by -s.
Clustering methods are inherently quadratic it run-time. In a practical deployment you would just prefer incremental mode, as it only requires this quadratic run-time for the new data.

Hope this helps, Konrad

mallorybobalice commented 7 years ago

Alight, so basically I have the same problem. I run process. Py with several workers. Each time an analysis is run, each worker spawns a malheur process. Say To 10 at once. Basically, if I understood @rieck correctly, we can improve the performance by trading batching mh results ala periodic retention to save cpu time vs immediate results . Say every 20 analyses at a good saving of cpu cost. If that is so, over top Brad @spender-sandbox and @rieck what the implications and implementation effort is.

Btw I know I can open the text files and have a look but what report bits are used to cluster and can we save effort by trading runtime detection via running malheur on signatures (cuckoo has behavior signatures often) if it runs on large inputs encapsulating fire gain behavior analysis if that's the case

Ps @mallorybobalice to test if editing works

mallorybobalice commented 7 years ago

Ps edited plz check back

rieck commented 7 years ago

I would recommend a larger chunk size, say 100 or even 500. Malheur implements a more or less complex feature extraction process. Please refer to the journal paper for details: https://www.sec.cs.tu-bs.de/pubs/2011-jcs.pdf

SeanKim777 commented 7 years ago

@mallorybobalice and @rieck Thanks for comment and kind information. I'm working on update malheur report module as advised will post it after I finished testing. not sure how long it take though

mallorybobalice commented 7 years ago

@spender-sandbox final bump :( ?

spender-sandbox commented 7 years ago

We don't bump stuff here -- I see all the messages fine. The problem remains that for many/most people's usage of Malheur, when they submit a sample they'd like to be able to immediately see what similar samples exist. The lock necessitates switching to a mode where that can't happen because it needs to be performed in chunks. I don't know myself how to resolve that, if I did I would comment. Or if you do, comment about that, but please don't just bump.

-Brad

rieck commented 7 years ago

Potentially off-topic: Would it make sense to have a tool that simply computes the similarity of one report to a group of previous reports and return the top most similar reports? This could be done much quicker than clustering --- definitely $O(n)$ but likely even faster.

spender-sandbox commented 7 years ago

Yes, that's pretty much what we were going for with the "similar" tab -- we ended up cutting off results actually because in some instances there are too many similar samples.

-Brad

rieck commented 7 years ago

Interesting, we have new research project that deals with analysis of malware behaviour. We are using neither Cuckoo nor Malheur. Nonetheless, something like a simple "similarity checker" could be done as part of this project.

@chwress: What do you think?

mallorybobalice commented 7 years ago

@spender-sandbox, sorry, and Let me elaborate. Not bump bump, but

when they submit a sample they'd like to be able to immediately see what similar samples exist.

I get this, like I said though for my case… at say 30k rolling mh reports and txt files, each analysis is delayed 4 5m with mh(with significant cpu load) . Is there any acceptable design for batch mode mh in cuckoo for those happy to say run mh reports every x samples or hours (what do you feel are the implications for ui, api and other modules(I'm only familiar with the ui similar) . Are the full implications incomplete Mongo and es or syslog reports sans being able to get mh info in the same report (hence completely incompatible? *or maybe, no immediate result but e. G dash indicator of mh reports done up to task x)

mallorybobalice commented 6 years ago

@rieck did you mean one of your recent projects in gh? I don't necesarilly mean for cuckoo modified but just in principle

rieck commented 6 years ago

I was talking about a "real" project. See here http://www.vamos-project.org

;)

spender-sandbox / cuckoo-modified

malheur report module take too long for more than 10k analyses #335