False positives strategy

gwillem commented 7 years ago

Hi! As noted in #45, your rules produce a lot of false positives for a clean Magento installation. How do you think the php-malware-finder project should handle this? Some options:

Accept it, interpreting fps will take much time
Extend whitelists for known legit apps. It is hard though to make whitelists for small modifications (ie patches).
Reduce the scope of signatures so they are more strict
others?

I am not sure about the best strategy. However I have created some Yara rules that are more strict and identify malware that was specifically found on Magento sites. Would love to collaborate!

Best Willem

jvoisin commented 7 years ago

Thank you for raising an issue and proposing solutions.

For us, 1. is of course unacceptable :)
About 2. we currently ship some (versioned) whitelists that can easily be amended.
For 3. it's really tricky to reduce the number of false-positives without compromising the efficiency of PMF, it's just that some Magento files are looking malicious, we must accept this fact.

Our current approach is to automatically generate whitelists for CMS with simple python scripts, feel free to take a look at them and to submit PR if you think that they could be improved :)

About your rules, most of them (especially frontend.yar)are based on artifacts, while we try to write rules based on (probable) behaviour/generic composition.

For example, this rule of your will catch eval(base64_decode(gzinflate($_GET['p']))), but won't match on eval (base64_decode (gzinflate ($_GET['p']))).

About collaboration, feel free to borrow as much as you want from PMF, it's released under LGPL. We'l be super-happy if you could share with us samples to analyse, or even better, generic rules :)

gwillem commented 7 years ago

Thanks for your elaborate answer! I created my own rules because I use them for automatic quarantining, and therefore need 99.9%+ accuracy.

So I learned that this malware signature business is a big tradeoff between accuracy and coverage. Artifact signatures have high accuracy but low coverage, and vice versa for behavioral signatures. So I guess our rules are complementary and serve two different goals!

That said, I would love to minimize the amount of manual work required to use your rules to find new malware (ie. having to dig through fewer FPs would help). Now I would gladly contribute SHA1 hashes for all Magento releases and extensions ever, but in your current architecture that would explode the execution time.

Perhaps I should do some testing first. I'll keep you posted!

gwillem commented 7 years ago

So I made a list of unique SHA1 hashes of all (public) Magento releases ever (7.2MB, 184K hashes):

https://buq.eu/magento/all-releases.sha1

And added a whitelist feature to mwscan:

# time mwscan -s nbs -w all-releases.sha1 all-magento/magento-2.0.6/
Wed Jan 18 20:36:06 2017 Using NBS rules.
Wed Jan 18 20:36:06 2017 Fetching php.yar
Wed Jan 18 20:36:06 2017 Fetching whitelist.yar
Wed Jan 18 20:36:06 2017 Fetching drupal.yar
Wed Jan 18 20:36:06 2017 Fetching wordpress.yar
Wed Jan 18 20:36:06 2017 Fetching symfony.yar
Wed Jan 18 20:36:06 2017 Fetching phpmyadmin.yar
Wed Jan 18 20:36:06 2017 Fetching magento2.yar
Wed Jan 18 20:36:07 2017 Fetching prestashop.yar
Wed Jan 18 20:36:07 2017 Fetching custom.yar
Wed Jan 18 20:36:07 2017 Fetching common.yar
Wed Jan 18 20:36:07 2017 Loaded 15 yara rules and 184941 whitelist entries
Wed Jan 18 20:36:09 2017 Finished scanning 29876 files: 0 malware and 29876 whitelisted.

real    0m2.570s
user    0m1.568s
sys 0m0.688s

Conclusion: using giant whitelists seems feasible, performance wise.

However, then I ran this all-magento whitelist on a few live sites and there were still tens of FPs, because of (private) extensions and core modifications. So I'm not sure an ultimate whitelist will help much, at least not for Magento where core modifications are (unfortunately) common.

So I guess we're back to options 1 or 3... What do you think?

jvoisin commented 7 years ago

Well, core modifications are bad™ ;) Whitelists for the core is a low-hanging fruit, since those false-positives will be present on every single Magento instances.

You do not need to whitelist every single file for every single Magento version: our Python script runs php-malware-finder on the whole directory, and spits a whitelist of detected files.

About 3, we do have a testsuite, with real samples in it. I do not think that we can reduce much the false-positives :/

For 1. what we do, is that we run pmf on a "clean" state of the website (The client dropping us an already-infected website is a risk that we're willing to take, and since we (quickly) review manually the first run of PMF, we might find the backdoors anyway), create and store a whitelist. So for the next run, we won't get any false-positive (or only a few, since the client might change its website's code).

gwillem commented 7 years ago

Re whitelisting core files: another issue I stumbled upon is that many sites have mangled line endings (\r\n vs \n), probably because of transformative FTP transfers. So whitelisting should probably require normalizing files (stripping whitespace) before hashing.

jvoisin commented 7 years ago

This would multiply by two the number of hashes :/

gwillem commented 7 years ago

... unless you normalize both during whitelist creation and malware checking.

jvoisin commented 7 years ago

I'm quite sure that we do not want to rewrite the client's files ;)

gwillem commented 7 years ago

Why would you need to write? Read => replace whitespace in-mem => hash => profit!

shaddai commented 7 years ago

So you mean that we should read file once and scan if for \r\n sequence, replace it, then make a hash, then scan again the file against the whitelist. Do you know how long it would take for a large site ?

When writing a tool such as PMF, you must keep in mind the scan time. As of today, pmf is already quite slow, so wasting more time (CPU time, but also I/O time) just because some shitty text editor or FTP client replaced \n with \r\n is way out of the scope of our project.

gwillem commented 7 years ago

@shaddai, am happy you're open to new ideas :)

We share a concern for speed, see also #45 for a 40 times speed improvement. The cpu time required to replace whitespaces is negligible in this context: on my laptop it takes 1.1s to remove all whitespace from 1GB of data.

I don't understand your remark about extra I/O: how is this involved if you replace in-mem?

gwillem commented 7 years ago

jvoisin commented 7 years ago

I'm not convinced that this is a relevant issue regarding our userbase: how many people are using broken ftp/text editor?

I also agree with @shaddai, reading every file once before scanning them is a significant waste of time.

Worst case, if this issue is a real concern, once can easily patch php-malware-finder to implement such normalization behaviour.

tl;dr We're not going to implement EOL-normalization upsteam, unless there is a really good reason/motivation to do so.

jvoisin / php-malware-finder

False positives strategy #46