segfault in analysisd caused by rule level change

mjeffe commented 9 years ago

I recently upgraded OSSEC from version 2.7.1 to 2.8.1. The new version of analysisd kept segfaulting. After a couple of weeks of working with it, I've narrowed it down to a simple rule override in local_rules.xml where I downgrade the level. Below I describe my test scenario where I can consistently reproduce the segfault with a bare minimum.

Current environment is Amazon AWS VPC. All servers are running Amazon Linux 64-bit AMI's (based on RHEL/CentOS).

[root@ossectst ossec]# uname -a
Linux ossectst 3.14.23-22.44.amzn1.x86_64 #1 SMP Tue Nov 11 23:07:48 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Here is what I did:

1) I spun up a brand new server (clone of my current ossec server), installed a fresh 2.8.1 ossec server. Modified local_rules.xml to look like this:

<!-- @(#) $Id: ./etc/rules/local_rules.xml, 2011/09/08 dcid Exp $
<group name="local,syslog,">
  <!-- downgrade from level 7 to level 5 -->
  <rule id="533" level="5" overwrite="yes">
    <if_sid>530</if_sid>
    <match>ossec: output: 'netstat -tan</match>
    <check_diff />
    <description>Listened ports status (netstat) changed (new port opened or closed).</description>
  </rule>
</group> <!-- SYSLOG,LOCAL -->

I did NOT add any agents, but let that run for about an hour (may not be necessary).

2) I then stopped the ossec server, ran manage_agents, added one agent and restarted the ossec server. NOTE, I did not transfer the agent key, so the agent was not trying to communicate yet. I let that run for about an hour (again, may not be necessary).

3) I transferred the key to the agent - this is an existing agent, I just added the new key and change the <server-ip> to point to my new test server and restarted the agent and server. I begin to see communication between server and agent. I let it run it's initial scans - about 10 minutes.

4) Then I opened a new listen port to try and trigger the netstat -tan rule using nc -l 7777 on the agent.

5) ossec-analysisd segfaulted after about 7 minues.

Let me know if you want any of the output or if you want me to run any other tests.

Note, I initially posted this to https://groups.google.com/forum/#!topic/ossec-list/WM3v7fmaS6I with the same subject title. dan (ddpbsd) was able to reproduce the segfault using the above information, but there was no resolution, so it was suggested I post here.

ddpbsd commented 9 years ago

I tried changing the level and overwriting a different rule to see if that triggered the bug as well, but it did not.

kravietz commented 9 years ago

I have been experiencing exactly the same problem and ended up running ossec-analysisd from within gdb. Here's one hint on the possible reason of the crash - and why it was so difficult to replicate:

2014/12/17 11:37:37 ossec-analysisd: DEBUG: FTSInit completed.
2014/12/17 11:37:37 ossec-analysisd: DEBUG: Active response Init completed.
2014/12/17 11:37:37 ossec-analysisd: DEBUG: Startup completed. Waiting for new messages..

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff784d171 in __strlen_sse2 () from /lib64/libc.so.6
(gdb) bt full
#0  0x00007ffff784d171 in __strlen_sse2 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff7b6788c in GeoIP_open () from /usr/lib64/libGeoIP.so.1
No symbol table info available.
#2  0x00007ffff7fd81eb in ?? ()
No symbol table info available.
#3  0x00007ffff7fd85fe in ?? ()
No symbol table info available.
#4  0x00007ffff7fbff41 in ?? ()
No symbol table info available.
#5  0x00007ffff7fc0955 in main ()
No symbol table info available.

So my suspicion was that there's a null pointer passed to GeoIP_open() which then calls strlen() on it and crashes. Logically, I checked the GeoIP configuration in OSSEC and while I had <use_geoip>yes</use_geoip> option in alerts section, there was no <geoip_db_path>/usr/share/GeoIP/GeoIP.dat</geoip_db_path> in global. After adding it the crashes seem to have stopped.

The only fix I would suggest in OSSEC itself is reporting an error when use_geoip is enabled, but no GeoIP database location is specified.

ddpbsd commented 9 years ago

@kravietz You may have discovered another issue. I don't have any geoip stuff set (or compiled in AFAIK), and can reproduce the crash.

kravietz commented 9 years ago

@ddpbsd Quite possible - I would suggest running the crashing daemon from within gdb, it will provide quite useful information to find the bug. Here's how I did it:

# cd /var/ossec/bin/
# gdb ossec-analysisd 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-75.el6)
(gdb) set args -fd
(gdb) run

And then just wait for the SIGSEGV to happen. When it does, run bt full and the results should provide a hint on the bug location. The arguments -fd are for the analysisd to run in foreground.

ddpbsd commented 9 years ago

Thanks, I'll continue to do exatly that.

On Wed, Dec 17, 2014 at 8:02 AM, Paweł Krawczyk notifications@github.com wrote:

@ddpbsd https://github.com/ddpbsd Quite possible - I would suggest running the crashing daemon from within gdb, it will provide quite useful information to find the bug. Here's how I did it:

cd /var/ossec/bin/

gdb ossec-analysisd

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-75.el6) (gdb) set args -fd (gdb) run

And then just wait for the SIGSEGV to happen. When it does, run bt full and the results should provide a hint on the bug location. The arguments -fd are for the analysisd to run in foreground.

— Reply to this email directly or view it on GitHub https://github.com/ossec/ossec-hids/issues/463#issuecomment-67318890.

ddpbsd commented 9 years ago

Everything I've done so far definitely points at https://github.com/ossec/ossec-hids/blob/master/src/analysisd/analysisd.c#L1664 but I don't know enough to be able to figure out what's going wrong.

mjeffe commented 9 years ago

Like ddpbsd, my experience seems to be pointing to the if(!currently_rule->event_search()... line, not GeoIP stuff. I've run analysisd under gdb a couple of different ways. Below is the process I used to attach to the running analysisd processes, and the output. I also tried to start analysisd with gdb as you described, but I was not confident I got the other ossec daemons started correctly.

[root@ossectst ossec]# ps -ef | grep ossec
root      5018  4903  0 12:39 pts/2    00:00:00 tail -f logs/ossec.log
root      5675  2407  0 12:50 pts/1    00:00:00 grep ossec
[root@ossectst ossec]# service ossec start
Starting OSSEC:                                            [  OK  ]
[root@ossectst ossec]# ps -ef | grep ossec
root      5018  4903  0 12:39 pts/2    00:00:00 tail -f logs/ossec.log
ossecm    5711     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-maild
root      5715     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-execd
ossec     5719     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-analysisd
root      5723     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-logcollector
ossecr    5728     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-remoted
root      5734     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-syscheckd
ossec     5737     1  0 12:50 ?        00:00:00 /var/ossec/bin/ossec-monitord
root      5747  2407  0 12:50 pts/1    00:00:00 grep ossec
[root@ossectst ossec]# gdb /var/ossec/bin/ossec-analysisd 5719

GNU gdb (GDB) Amazon Linux (7.6.1-51.24.amzn1)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-amazon-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /var/ossec/bin/ossec-analysisd...done.
Attaching to program: /var/ossec/bin/ossec-analysisd, process 5719
Reading symbols from /lib64/libm.so.6...Reading symbols from /usr/lib/debug/lib64/libm-2.17.so.debug...done.
done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...Reading symbols from /usr/lib/debug/lib64/libc-2.17.so.debug...done.
done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib64/ld-2.17.so.debug...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...Reading symbols from /usr/lib/debug/lib64/libnss_files-2.17.so.debug...done.
done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007f9f2b003b53 in __recvfrom_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007f9f2b003b53 in __recvfrom_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000430569 in OS_RecvUnix (socket=4, sizet=6144, ret=0x7fffe0f85870 "1:/var/log/maillog") at os_net.c:539
#2  0x0000000000403612 in OS_ReadMSG (m_queue=4) at analysisd.c:754
#3  0x0000000000403262 in main (argc=1, argv=0x7fffe0f87268) at analysisd.c:555
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()

(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x0000000000404a0e in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x1addcd0) at analysisd.c:1631
#2  0x0000000000404a53 in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x1adcac0) at analysisd.c:1654
#3  0x0000000000404a53 in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x1ad53f0) at analysisd.c:1654
#4  0x0000000000404a53 in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x199c4e0) at analysisd.c:1654
#5  0x0000000000403a5e in OS_ReadMSG (m_queue=4) at analysisd.c:984
#6  0x0000000000403262 in main (argc=1, argv=0x7fffe0f87268) at analysisd.c:555
(gdb) list
76 #else
77
78 /* This is a "normal" system call stub: if there is an error,
79    it returns -1 and sets errno.  */
80
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
82    ret
83 T_PSEUDO_END (SYSCALL_SYMBOL)
84
85 #endif

(gdb) bt full
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#1  0x0000000000404a0e in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x1addcd0) at analysisd.c:1631
        currently_rule = 0x1add7c0
#2  0x0000000000404a53 in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x1adcac0) at analysisd.c:1654
        child_node = 0x1addcd0
        child_rule = 0x0
        currently_rule = 0x1adc770
#3  0x0000000000404a53 in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x1ad53f0) at analysisd.c:1654
        child_node = 0x1adcac0
        child_rule = 0x0
        currently_rule = 0x1ad5ff0
#4  0x0000000000404a53 in OS_CheckIfRuleMatch (lf=0x1b0a490, curr_node=0x199c4e0) at analysisd.c:1654
        child_node = 0x1ad53f0
        child_rule = 0x0
        currently_rule = 0x199c210
#5  0x0000000000403a5e in OS_ReadMSG (m_queue=4) at analysisd.c:984
        rulenode_pt = 0x199c4e0
        i = 765
        msg = "1:(itchy) 10.0.1.0->netstat -tan |grep LISTEN |grep -v 127.0.0.1 | sort\000ossec: output: 'netstat -tan |grep LISTEN |grep -v 127.0.0.1 | sort':\ntcp        0      0 0.0.0.0:22", ' ' <repeats 18 times>, "0.0.0.0:*", ' ' <repeats 19 times>...
        lf = 0x1b0a490

---Type <return> to continue, or q <return> to quit---
        stats_rule = 0x1b08b00
#6  0x0000000000403262 in main (argc=1, argv=0x7fffe0f87268) at analysisd.c:555

        c = -2
        m_queue = 4
        test_config = 0
        run_foreground = 0
        debug_level = 0
        dir = 0x444bc0 "/var/ossec"
        user = 0x444bcb "ossec"
        group = 0x444bcb "ossec"
        uid = 503
        gid = 502
        cfg = 0x444bd1 "/var/ossec/etc/ossec.conf"
(gdb) cont
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) quit

[root@ossectst ossec]# tail /var/log/messages
... snip ...
Dec 15 10:45:51 ossectst kernel: [ 2082.001287] ossec-analysisd[3772]: segfault at 0 ip           (null) sp 00007fffa4fa8218 error 14 in ossec-analysisd[400000+65000]

ddpbsd commented 9 years ago

This problem was also present in 2.7.1. Not sure how far back I want to go checking on this though.

ddpbsd commented 9 years ago

Commenting out the check_diff option kept it from crashing for me. (not a solution, just troubleshooting)

ddpbsd commented 9 years ago

I have a potential fix in this branch: https://github.com/ddpbsd/ossec-hids/tree/diff_overwrite I'm sure it's missing something, but my keyboard can't take many more hits from my forehead. It'd be great if someone could test it out a bit.

jrossi commented 9 years ago

@mjeffe did you by chance get time to test out @ddpbsd potential fix?

mjeffe commented 9 years ago

I did not look into it over the christmas break. I've got a current workaround so my production system is functional. I still want to help with this issue however, so I will try to test the fix this week.

mjeffe commented 9 years ago

For some reason src/header/zlib.h and zconf.h were missing from the diff_overwrite branch. I grabed them from my current 2.8.1 src/header dir and was then able to complie and install.

In file included from os_crypto/shared/keys.c:21:0:
./os_zlib/os_zlib.h:14:18: fatal error: zlib.h: No such file or directory
 #include "zlib.h"
                  ^

Now I'll try to crash it...

mjeffe commented 9 years ago

Well, it's been running all day with no segfault. I'll let it run over the weekend and then report back.

mjeffe commented 9 years ago

It ran all weekend with no segfaults. Anything else you guys need me to try?

ddpbsd commented 9 years ago

@mjeffe Thanks for testing.I'll open a pull request.

jrossi commented 9 years ago

@mjeffe Accepted into master. Closing ticket.

ossec / ossec-hids

segfault in analysisd caused by rule level change #463

cd /var/ossec/bin/

gdb ossec-analysisd