wazuh / wazuh-qa

Wazuh - Quality Assurance
GNU General Public License v2.0
65 stars 32 forks source link

FIM System tests: 0202 - Realtime monitoring #531

Closed spothound closed 4 years ago

spothound commented 4 years ago

Hello team,

This issue is related to #493. It will contain the advances related to the creation of the scenario 0202

Objectives

Implement scenario 202.

Tasks

Regards, Francisco Navarro.

manuasir commented 4 years ago

Thanks @grg121 for taking this. Don't hesitate to ask the team anything that you might want, and please use this ticket for tracking purposes. :+1:

spothound commented 4 years ago

Ey, thanks @manuasir, all right, I'll keep you updated!

Working on this issue.

Branch: https://github.com/wazuh/wazuh-qa

    <directories realtime="yes" check_all="yes" recursion_level="4">/opt/fim_testing</directories>
spothound commented 4 years ago

I'm running the test within the testing environment and I've detected a problem with the default value of the whodata_check var on verify_alerts_json.py and verify_alerts_elasticsearch.py

Fix: https://github.com/wazuh/wazuh-qa/commit/8f7e990f0e9be14054727eac9f772e315d4b26f9

The problem is that "False" and "false" default values are interpreted as True in Python.

spothound commented 4 years ago

Testing

Automatic

Execution 0 (Syscheck frequency of 5 seconds)

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 10000},
        { "size": 524288, "amount": 1000},
        { "size": 1048576, "amount": 100},
        { "size": 10485760, "amount": 1}
    ]
}

SUSCESSFULL

172.16.0.111               : ok=20   changed=11   unreachable=0    failed=0    skipped=22   rescued=0    ignored=0   
172.16.0.131               : ok=15   changed=8    unreachable=0    failed=0    skipped=27   rescued=0    ignored=0   
172.16.0.141               : ok=15   changed=9    unreachable=0    failed=0    skipped=27   rescued=0    ignored=0   
172.16.0.161               : ok=4    changed=0    unreachable=0    failed=0    skipped=38   rescued=0    ignored=0 

All the expected alerts where generated.

Manually testing

Nevertheless, I wanted to know if the alerts where generated actually in realtime so I generated an alert manually by creating the following file:

touch /opt/fim_testing/realtime/$(  date +"%T")

The alert generated was:

{
  "timestamp": "2020-02-25T23:50:35.419+0000",
  "rule": {
    "level": 5,
    "description": "File added to the system.",
    "id": "554",
    "firedtimes": 22204,
    "mail": false,
    "groups": [
      "ossec",
      "syscheck"
    ],
    "pci_dss": [
      "11.5"
    ],
    "gpg13": [
      "4.11"
    ],
    "gdpr": [
      "II_5.1.f"
    ],
    "hipaa": [
      "164.312.c.1",
      "164.312.c.2"
    ],
    "nist_800_53": [
      "SI.7"
    ]
  },
  "agent": {
    "id": "002",
    "name": "Centos",
    "ip": "10.0.2.15"
  },
  "manager": {
    "name": "serv-test-manager-centos-1"
  },
  "id": "1582674635.481796020",
  "full_log": "File '/opt/fim_testing/realtime/23:50:35' added\nMode: real-time\n",
  "syscheck": {
    "path": "/opt/fim_testing/realtime/23:50:35",
    "size_after": "0",
    "perm_after": "rw-r--r--",
    "uid_after": "0",
    "gid_after": "0",
    "md5_after": "d41d8cd98f00b204e9800998ecf8427e",
    "sha1_after": "da39a3ee5e6b4b0d3255bfef95601890afd80709",
    "sha256_after": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "uname_after": "root",
    "gname_after": "root",
    "mtime_after": "2020-02-25T23:50:35",
    "inode_after": 100993634,
    "event": "added"
  },
  "decoder": {
    "name": "syscheck_new_entry"
  },
  "location": "syscheck"
}

Here, mtime_after coincides with the moment I generated the file (23:50:35) and the alert timestamp (timestamp) so it seems that the alert has been generated as soon as the file was added.

Nonetheless, when running the automatic test, the generates alerts have some delay between the moment the file was added/modified and the alert generation.

I decided to try calculating it for each alert generated during the test and plotting it getting the following graph:

Figure_1 *note: just modification alerts

The x-axis represents delay time in seconds between an event and the triggered alert. The y-axis represents the number of alerts with that delay.

spothound commented 4 years ago

I thought maybe the problem was related with syscheck synchronization time, as it was, by default, set to 10 seconds in the agent (notice that realtime is disabled while a syscheck scan is performed). So I increased the time between scans to 5 minutes and launch the second automatic test:

Testing

Automatic

Execution 1 (Syscheck frequency of 5 minutes)

(same config)

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 10000},
        { "size": 524288, "amount": 1000},
        { "size": 1048576, "amount": 100},
        { "size": 10485760, "amount": 1}
    ]
}

This time, Windows modified alers failed with 6582 missing alerts

```json
failed: [172.16.0.111] (item={'uid': 1000, 'woth': False, 'mtime': 1582677236.9333484, 'inode': 34315478, 'isgid': False, 'size': 390463, 'roth': True, 'isuid': False, 'isreg': True, 'pw_name': 'vagrant', 'gid': 100, 'ischr': False, 'wusr': True, 'xoth': False, 'rusr': True, 'nlink': 1, 'issock': False, 'rgrp': True, 'gr_name': 'users', 'path': '/opt/agents_files_output/modified/files_modified.txt-172.16.0.141', 'xusr': False, 'atime': 1582677237.841035, 'isdir': False, 'ctime': 1582677237.842035, 'isblk': False, 'xgrp': False, 'dev': 2049, 'wgrp': False, 'isfifo': False, 'mode': '0644', 'islnk': False}) => {"ansible_loop_var": "item", "changed": true, "item": {"atime": 1582677237.841035, "ctime": 1582677237.842035, "dev": 2049, "gid": 100, "gr_name": "users", "inode": 34315478, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": true, "issock": false, "isuid": false, "mode": "0644", "mtime": 1582677236.9333484, "nlink": 1, "path": "/opt/agents_files_output/modified/files_modified.txt-172.16.0.141", "pw_name": "vagrant", "rgrp": true, "roth": true, "rusr": true, "size": 390463, "uid": 1000, "wgrp": false, "woth": false, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 172.16.0.111 closed.\r\n", "stderr_lines": ["Shared connection to 172.16.0.111 closed."], "stdout": "2020-02-26 00:34:01,890 [INFO] alerts.json verification started\r\n2020-02-26 00:34:02,683 [INFO] Filelist related alerts aren't growing (0) ...\r\n2020-02-26 00:35:02,738 [INFO] Elapsed time: ~ 60 seconds \r\n\r\n2020-02-26 00:35:03,417 [INFO] Filelist related alerts aren't growing (1) ...\r\n2020-02-26 00:36:03,435 [INFO] Elapsed time: ~ 121 seconds \r\n\r\n2020-02-26 00:36:04,023 [INFO] Filelist related alerts aren't growing (2) ...\r\n2020-02-26 00:37:04,056 [INFO] Elapsed time: ~ 182 seconds \r\n\r\n2020-02-26 00:37:04,641 [INFO] Filelist related alerts aren't growing (3) ...\r\n2020-02-26 00:38:04,686 [INFO] Elapsed time: ~ 242 seconds \r\n\r\n2020-02-26 00:38:05,252 [INFO] Filelist related alerts aren't growing (4) ...\r\n2020-02-26 00:39:05,293 [INFO] Elapsed time: ~ 303 seconds \r\n\r\n2020-02-26 00:39:06,192 [WARNING] Verify alerts test - NOT OK. 6582 alerts are missing.\r\n\r\n2020-02-26 00:39:06,195 [WARNING] 6582 missing alerts.\r\n\r\n", "stdout_lines": ["2020-02-26 00:34:01,890 [INFO] alerts.json verification started", "2020-02-26 00:34:02,683 [INFO] Filelist related alerts aren't growing (0) ...", "2020-02-26 00:35:02,738 [INFO] Elapsed time: ~ 60 seconds ", "", "2020-02-26 00:35:03,417 [INFO] Filelist related alerts aren't growing (1) ...", "2020-02-26 00:36:03,435 [INFO] Elapsed time: ~ 121 seconds ", "", "2020-02-26 00:36:04,023 [INFO] Filelist related alerts aren't growing (2) ...", "2020-02-26 00:37:04,056 [INFO] Elapsed time: ~ 182 seconds ", "", "2020-02-26 00:37:04,641 [INFO] Filelist related alerts aren't growing (3) ...", "2020-02-26 00:38:04,686 [INFO] Elapsed time: ~ 242 seconds ", "", "2020-02-26 00:38:05,252 [INFO] Filelist related alerts aren't growing (4) ...", "2020-02-26 00:39:05,293 [INFO] Elapsed time: ~ 303 seconds ", "", "2020-02-26 00:39:06,192 [WARNING] Verify alerts test - NOT OK. 6582 alerts are missing.", "", "2020-02-26 00:39:06,195 [WARNING] 6582 missing alerts.", ""]}

I generated another graph:

Delay times 2

Execution 2 (Syscheck frequency of 5 minutes)

PLAY RECAP ***********************************************************************************************************
172.16.0.111               : ok=20   changed=11   unreachable=0    failed=0    skipped=22   rescued=0    ignored=0   
172.16.0.131               : ok=15   changed=8    unreachable=0    failed=0    skipped=27   rescued=0    ignored=0   
172.16.0.141               : ok=15   changed=8    unreachable=0    failed=0    skipped=27   rescued=0    ignored=0   
172.16.0.161               : ok=4    changed=0    unreachable=0    failed=0    skipped=38   rescued=0    ignored=0 

All the alerts generated.

Graphics:

Figure_1

Comparing Windows and Linux agent I get to the following conclusions:

spothound commented 4 years ago

Testing

Automatic

Execution 2 (Syscheck frequency of 500 seconds)

(same config)

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 8,
    "folder_length": 10,
    "file_length": 10,
    "file_size_specifications":[
        { "size": 10240, "amount": 50000},
        { "size": 524288, "amount": 100},
        { "size": 1048576, "amount": 100},
        { "size": 10485760, "amount": 1}
    ]
}

With this config, the test fails by the Verify alerts test - NOT OK. 460 alerts are missing.

Nevertheless, if I check the manager it seems that these alerts are still being generated... so I would like to investigate if the test is ending correctly or it would require more time.

As we've noticed, the way this test is performed, it generates some delay in the alert generation due to the overhead. If the test seems to fail when that overhead is big... maybe we need to ensure there are alerts that are being dropped and is not simply a lack of testing time.

Error log

failed: [172.16.0.111] (item={'uid': 1000, 'woth': False, 'mtime': 1582681957.5618412, 'inode': 67640362, 'isgid': False, 'size': 2797442, 'roth': True, 'isuid': False, 'isreg': True, 'pw_name': 'vagrant', 'gid': 100, 'ischr': False, 'wusr': True, 'xoth': False, 'rusr': True, 'nlink': 1, 'issock': False, 'rgrp': True, 'gr_name': 'users', 'path': '/opt/agents_files_output/added/files_generated.txt-172.16.0.131', 'xusr': False, 'atime': 1582682225.8554828, 'isdir': False, 'ctime': 1582682225.888483, 'isblk': False, 'xgrp': False, 'dev': 2049, 'wgrp': False, 'isfifo': False, 'mode': '0644', 'islnk': False}) => {"ansible_loop_var": "item", "changed": true, "item": {"atime": 1582682225.8554828, "ctime": 1582682225.888483, "dev": 2049, "gid": 100, "gr_name": "users", "inode": 67640362, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": true, "issock": false, "isuid": false, "mode": "0644", "mtime": 1582681957.5618412, "nlink": 1, "path": "/opt/agents_files_output/added/files_generated.txt-172.16.0.131", "pw_name": "vagrant", "rgrp": true, "roth": true, "rusr": true, "size": 2797442, "uid": 1000, "wgrp": false, "woth": false, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 172.16.0.111 closed.\r\n", "stderr_lines": ["Shared connection to 172.16.0.111 closed."], "stdout": "2020-02-26 01:57:06,987 [INFO] alerts.json verification started\r\n2020-02-26 01:57:07,058 [INFO] Filelist related alerts aren't growing (0) ...\r\n2020-02-26 01:58:07,113 [INFO] Elapsed time: ~ 60 seconds \r\n\r\n2020-02-26 01:58:07,202 [INFO] Filelist related alerts aren't growing (1) ...\r\n2020-02-26 01:59:07,219 [INFO] Elapsed time: ~ 120 seconds \r\n\r\n2020-02-26 01:59:07,310 [INFO] Filelist related alerts aren't growing (2) ...\r\n2020-02-26 02:00:07,340 [INFO] Elapsed time: ~ 180 seconds \r\n\r\n2020-02-26 02:00:07,400 [INFO] Filelist related alerts aren't growing (3) ...\r\n2020-02-26 02:01:07,456 [INFO] Elapsed time: ~ 240 seconds \r\n\r\n2020-02-26 02:02:07,727 [INFO] Elapsed time: ~ 300 seconds \r\n\r\n2020-02-26 02:03:08,875 [INFO] Elapsed time: ~ 361 seconds \r\n\r\n2020-02-26 02:04:11,003 [INFO] Elapsed time: ~ 424 seconds \r\n\r\n2020-02-26 02:05:13,966 [INFO] Elapsed time: ~ 486 seconds \r\n\r\n2020-02-26 02:06:18,214 [INFO] Elapsed time: ~ 551 seconds \r\n\r\n2020-02-26 02:06:22,383 [INFO] Filelist related alerts aren't growing (0) ...\r\n2020-02-26 02:07:22,457 [INFO] Elapsed time: ~ 615 seconds \r\n\r\n2020-02-26 02:07:26,357 [INFO] Filelist related alerts aren't growing (1) ...\r\n2020-02-26 02:08:26,377 [INFO] Elapsed time: ~ 679 seconds \r\n\r\n2020-02-26 02:08:30,602 [INFO] Filelist related alerts aren't growing (2) ...\r\n2020-02-26 02:09:30,668 [INFO] Elapsed time: ~ 743 seconds \r\n\r\n2020-02-26 02:09:35,510 [INFO] Filelist related alerts aren't growing (3) ...\r\n2020-02-26 02:10:35,554 [INFO] Elapsed time: ~ 808 seconds \r\n\r\n2020-02-26 02:10:44,293 [INFO] Filelist related alerts aren't growing (4) ...\r\n2020-02-26 02:11:44,322 [INFO] Elapsed time: ~ 877 seconds \r\n\r\n2020-02-26 02:11:53,649 [WARNING] Verify alerts test - NOT OK. 460 alerts are missing.\r\n\r\n2020-02-26 02:11:53,682 [WARNING] 460 missing alerts.\r\n\r\n", "stdout_lines": ["2020-02-26 01:57:06,987 [INFO] alerts.json verification started", "2020-02-26 01:57:07,058 [INFO] Filelist related alerts aren't growing (0) ...", "2020-02-26 01:58:07,113 [INFO] Elapsed time: ~ 60 seconds ", "", "2020-02-26 01:58:07,202 [INFO] Filelist related alerts aren't growing (1) ...", "2020-02-26 01:59:07,219 [INFO] Elapsed time: ~ 120 seconds ", "", "2020-02-26 01:59:07,310 [INFO] Filelist related alerts aren't growing (2) ...", "2020-02-26 02:00:07,340 [INFO] Elapsed time: ~ 180 seconds ", "", "2020-02-26 02:00:07,400 [INFO] Filelist related alerts aren't growing (3) ...", "2020-02-26 02:01:07,456 [INFO] Elapsed time: ~ 240 seconds ", "", "2020-02-26 02:02:07,727 [INFO] Elapsed time: ~ 300 seconds ", "", "2020-02-26 02:03:08,875 [INFO] Elapsed time: ~ 361 seconds ", "", "2020-02-26 02:04:11,003 [INFO] Elapsed time: ~ 424 seconds ", "", "2020-02-26 02:05:13,966 [INFO] Elapsed time: ~ 486 seconds ", "", "2020-02-26 02:06:18,214 [INFO] Elapsed time: ~ 551 seconds ", "", "2020-02-26 02:06:22,383 [INFO] Filelist related alerts aren't growing (0) ...", "2020-02-26 02:07:22,457 [INFO] Elapsed time: ~ 615 seconds ", "", "2020-02-26 02:07:26,357 [INFO] Filelist related alerts aren't growing (1) ...", "2020-02-26 02:08:26,377 [INFO] Elapsed time: ~ 679 seconds ", "", "2020-02-26 02:08:30,602 [INFO] Filelist related alerts aren't growing (2) ...", "2020-02-26 02:09:30,668 [INFO] Elapsed time: ~ 743 seconds ", "", "2020-02-26 02:09:35,510 [INFO] Filelist related alerts aren't growing (3) ...", "2020-02-26 02:10:35,554 [INFO] Elapsed time: ~ 808 seconds ", "", "2020-02-26 02:10:44,293 [INFO] Filelist related alerts aren't growing (4) ...", "2020-02-26 02:11:44,322 [INFO] Elapsed time: ~ 877 seconds ", "", "2020-02-26 02:11:53,649 [WARNING] Verify alerts test - NOT OK. 460 alerts are missing.", "", "2020-02-26 02:11:53,682 [WARNING] 460 missing alerts.", ""]}

I would like to investigate if these missing alerts are actually missing or will be generated in the following syscheck simple scan (maybe realtime dropped them but FIM will actually detect these changes).

I will continue with that tomorrow.

Best regards,

spothound commented 4 years ago

Hello team,

I've been talking with Alberto Rodriguez and Jose M about this scenario, we've concluded we will need an additional scenario to "Check overlap of scheduled syscheck scan and realtime scan".

The scenario developed for this issue will be only for realtime (with a long scan frequency, so there are no overlaps) and we want to ensure that files start being generated after the first scan finish when the agent is restarted.

Also, I want to make sure that alerts marked as dropped are actually dropped by increasing the timeout of the test and maybe the replays of the alert checking.

So, to finish this issue, today I will:


spothound commented 4 years ago

Hello team, here the script used to generate these graphics of time delay: realtime_delay.zip

spothound commented 4 years ago

Testing

Automatic

Execution 3 (Syscheck frequency of 50000 seconds) retry: 4

(same config)

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 10000},
        { "size": 524288, "amount": 1000},
        { "size": 1048576, "amount": 100},
        { "size": 10485760, "amount": 1}
    ]
}

This test have not passed with the new changes.

3 Failed replays.

Testing

Automatic

Execution 4 (Syscheck frequency of 50000 seconds) retry: 4

A small number of files for realtime

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 1000},
        { "size": 524288, "amount": 100},
        { "size": 1048576, "amount": 10},
        { "size": 10485760, "amount": 1}
    ]
}

Fails on windows checking delete alerts

2020-02-27 00:24:57,767 [WARNING] 1110 missing alerts.

Add and modification seems to be right.

Testing

Automatic

Execution 5 (Syscheck frequency of 50000 seconds) retry: 4

A small number of files for realtime

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 100},
        { "size": 524288, "amount": 100},
        { "size": 1048576, "amount": 10},
        { "size": 10485760, "amount": 1}
    ]
}

:ok:


PLAY RECAP ***********************************************************************************************************
172.16.0.111               : ok=20   changed=11   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
172.16.0.131               : ok=17   changed=9    unreachable=0    failed=0    skipped=29   rescued=0    ignored=0   
172.16.0.141               : ok=17   changed=9    unreachable=0    failed=0    skipped=29   rescued=0    ignored=0   
172.16.0.161               : ok=4    changed=0    unreachable=0    failed=0    skipped=42   rescued=0    ignored=0  

small

With a small number of files everything seems to work (all the alerts are correctly generated). Nevertheless, it seems that the delay in windows is higher than Linux. We'll keep checking that.

spothound commented 4 years ago

Testing

Automatic

Execution 6 (Syscheck frequency of 50000 seconds) retry: 4

A small number of files for realtime

{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 10000},
        { "size": 524288, "amount": 1000},
        { "size": 1048576, "amount": 100},
        { "size": 10485760, "amount": 1}
    ]
}
spothound commented 4 years ago

I've detected this test fills the realtime queue, the following message appear in the ossec.log:

WARNING: Real-time inotify kernel queue is full. Some events may be lost. Next scheduled scan will recover lost data.

New aproach: do not use a "infinite" frequency for the scan, instead use 5 mintues, so realtime works for 5 minutes and then a schecheduled scan fit the "lost" alerts.

spothound commented 4 years ago

We have decided the following for this scenario:

spothound commented 4 years ago

After making these changes:

Testing

Automatic

Execution 7 (Syscheck frequency of 50000 seconds) retry: 4

elastic_verification_extra_arguments: "-r 10"
alert_verification_extra_arguments: "-r 10"
create_script_extra_params: "-b 500 -w 5"
delete_script_extra_params: "-b 500 -w 5"
modify_script_extra_params: "-b 500 -w 5"
clean_logs: true
{
    "root_folder": "{{ agents_fim_testing_path }}",
    "recursion_level": 3,
    "folder_length": 5,
    "file_length": 5,
    "file_size_specifications":[
        { "size": 10240, "amount": 10000},
        { "size": 524288, "amount": 1000},
        { "size": 1048576, "amount": 100},
        { "size": 10485760, "amount": 1}
    ]
}
172.16.0.111               : ok=18   changed=9    unreachable=0    failed=1    skipped=26   rescued=0    ignored=0   
172.16.0.131               : ok=17   changed=9    unreachable=0    failed=0    skipped=29   rescued=0    ignored=0   
172.16.0.141               : ok=17   changed=9    unreachable=0    failed=0    skipped=29   rescued=0    ignored=0   
172.16.0.161               : ok=4    changed=0    unreachable=0    failed=0    skipped=42   rescued=0    ignored=0 

Failed only in the delete step... Windows queue overflow!

Add

Figure_1

Modify

Figure_1