wazuh / wazuh-indexer

Wazuh indexer, the Wazuh search engine
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
11 stars 17 forks source link

[BUG] Indexer failed after a while - 4.8.0 Beta2 #167

Closed davidcr01 closed 7 months ago

davidcr01 commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

During https://github.com/wazuh/wazuh/issues/22122, a new bug has been found. The installation is performed using the Wazuh installation assistant, using the 4.8.0 beta2 packages.

The installation is performed correctly, but after some time, the Wazuh indexer service fails, staying on a red status:

[root@ip-172-31-47-43 ec2-user]# systemctl status wazuh-indexer
● wazuh-indexer.service - Wazuh-indexer
   Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Mon 2024-02-26 15:14:54 UTC; 3h 7min ago
     Docs: https://documentation.wazuh.com
  Process: 16640 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (code=killed, signal=KILL)
 Main PID: 16640 (code=killed, signal=KILL)

Feb 26 15:02:10 ip-172-31-47-43.ec2.internal systemd-entrypoint[16640]: ...
Feb 26 15:02:10 ip-172-31-47-43.ec2.internal systemd-entrypoint[16640]: ...
Feb 26 15:02:13 ip-172-31-47-43.ec2.internal systemd-entrypoint[16640]: ...
Feb 26 15:02:13 ip-172-31-47-43.ec2.internal systemd-entrypoint[16640]: ...
Feb 26 15:02:13 ip-172-31-47-43.ec2.internal systemd-entrypoint[16640]: ...
Feb 26 15:02:13 ip-172-31-47-43.ec2.internal systemd-entrypoint[16640]: ...
Feb 26 15:02:31 ip-172-31-47-43.ec2.internal systemd[1]: ...
Feb 26 15:14:54 ip-172-31-47-43.ec2.internal systemd[1]: ...
Feb 26 15:14:54 ip-172-31-47-43.ec2.internal systemd[1]: ...
Feb 26 15:14:54 ip-172-31-47-43.ec2.internal systemd[1]: ...
Hint: Some lines were ellipsized, use -l to show in full.
[root@ip-172-31-47-43 ec2-user]# 

The logs:

[2024-02-26T15:04:17,941][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:04:18,044][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:59176}
[2024-02-26T15:04:19,529][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:04:19,552][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:48908}
[2024-02-26T15:04:23,434][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:04:23,442][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:48914}
[2024-02-26T15:04:29,785][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:04:29,799][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:42496}
[2024-02-26T15:04:41,017][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:04:41,034][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:44352}
[2024-02-26T15:05:10,542][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:05:10,554][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:59334}
[2024-02-26T15:06:07,737][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:06:07,745][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:33012}
[2024-02-26T15:06:58,531][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:06:58,541][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:52146}
[2024-02-26T15:07:33,575][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
[2024-02-26T15:07:33,578][WARN ][o.o.h.AbstractHttpServerTransport] [node-1] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:46760}
[2024-02-26T15:07:47,388][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:50466
[2024-02-26T15:07:49,320][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:50466
[2024-02-26T15:07:52,963][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:50466
[2024-02-26T15:08:00,982][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:50466
[2024-02-26T15:08:15,698][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:50466
[2024-02-26T15:08:16,783][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:52240
[2024-02-26T15:08:19,335][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:52092
[2024-02-26T15:08:23,823][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:52096
[2024-02-26T15:08:32,614][WARN ][o.o.s.a.BackendRegistry  ] [node-1] Authentication finally failed for admin from 127.0.0.1:39312
[root@ip-172-31-47-43 ec2-user]# 

To Reproduce Steps to reproduce the behavior:

  1. Download the 4.8.0 beta2 Wazuh installation assistant: curl -sO https://packages-dev.wazuh.com/4.8/wazuh-install.sh && bash wazuh-install.sh -a and perform and AIO installation.
  2. Consult for errors in /var/log/wazuh-indexer/wazuh-cluster.log

Expected behavior No errors generated.

Plugins None.

Host/Environment (please complete the following information):

Extra information This was not reproduced in the previous stage: https://github.com/wazuh/wazuh/issues/21799

f-galland commented 7 months ago

@davidcr01, I cannot reproduce the issue on a vagrant VM using the generic/rhel9 box.

Grepping for errors in wazuh-cluster.log only shows installer probes before the security plugins is initialized:

[root@rhel9 ~]# grep ERROR /var/log/wazuh-indexer/wazuh-cluster.log 
[2024-02-26T19:36:20,423][ERROR][o.o.s.a.s.SinkProvider   ] [node-1] Default endpoint could not be created, auditlog will not work properly.
[2024-02-26T19:36:22,798][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,800][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,800][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,806][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,807][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,807][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,807][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,807][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,807][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,807][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node-1] Failure no such index [.opendistro_security] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
[2024-02-26T19:36:22,932][ERROR][o.o.s.a.BackendRegistry  ] [node-1] Not yet initialized (you may need to run securityadmin)
[2024-02-26T19:51:04,772][ERROR][o.o.s.a.s.SinkProvider   ] [node-1] Default endpoint could not be created, auditlog will not work properly.
[2024-02-26T19:51:09,512][ERROR][o.o.s.a.BackendRegistry  ] [node-1] Not yet initialized (you may need to run securityadmin)
[2024-02-26T19:51:09,555][ERROR][o.o.s.a.BackendRegistry  ] [node-1] Not yet initialized (you may need to run securityadmin)
[2024-02-26T19:51:09,560][ERROR][o.o.s.a.BackendRegistry  ] [node-1] Not yet initialized (you may need to run securityadmin)
[2024-02-26T19:51:09,569][ERROR][o.o.s.a.BackendRegistry  ] [node-1] Not yet initialized (you may need to run securityadmin)

If you look closely at your wazuh-cluster.log output, you will see that there is a certificate error of some kind going on:

[2024-02-26T15:07:33,575][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [node-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate

I'll leave the VM running, but I doubt this will change anything.

f-galland commented 7 months ago

I was finally able to reproduce the issue with a clean vagrant VM running the automated installer. The service run for a while and then it exited with an oom-kill error (out of memory).

Feb 27 14:02:11 ubuntu-jammy systemd[1]: wazuh-indexer.service: Consumed 1min 29.469s CPU time.
Hint: Some lines were ellipsized, use -l to show in full.
root@ubuntu-jammy:~# systemctl status wazuh-indexer --no-pager -l
× wazuh-indexer.service - Wazuh-indexer
     Loaded: loaded (/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: enabled)
     Active: failed (Result: oom-kill) since Tue 2024-02-27 14:02:11 UTC; 6min ago
       Docs: https://documentation.wazuh.com
    Process: 4321 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (code=killed, signal=KILL)
   Main PID: 4321 (code=killed, signal=KILL)
        CPU: 1min 29.469s

Feb 27 13:41:42 ubuntu-jammy systemd-entrypoint[4321]: WARNING: System::setSecurityManager will be removed in a future release
Feb 27 13:41:43 ubuntu-jammy systemd-entrypoint[4321]: WARNING: A terminally deprecated method in java.lang.System has been called
Feb 27 13:41:43 ubuntu-jammy systemd-entrypoint[4321]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/wazuh-indexer/lib/opensearch-2.10.0.jar)
Feb 27 13:41:43 ubuntu-jammy systemd-entrypoint[4321]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
Feb 27 13:41:43 ubuntu-jammy systemd-entrypoint[4321]: WARNING: System::setSecurityManager will be removed in a future release
Feb 27 13:41:50 ubuntu-jammy systemd[1]: Started Wazuh-indexer.
Feb 27 14:02:11 ubuntu-jammy systemd[1]: wazuh-indexer.service: A process of this unit has been killed by the OOM killer.
Feb 27 14:02:11 ubuntu-jammy systemd[1]: wazuh-indexer.service: Main process exited, code=killed, status=9/KILL
Feb 27 14:02:11 ubuntu-jammy systemd[1]: wazuh-indexer.service: Failed with result 'oom-kill'.
Feb 27 14:02:11 ubuntu-jammy systemd[1]: wazuh-indexer.service: Consumed 1min 29.469s CPU time.

On closer inspection, it seems wazuh-modulesd is exhausting available memory:

root@ubuntu-jammy:~# ps -eo 'cmd,%cpu,%mem' --sort '%mem' | tail -5
/var/ossec/framework/python  0.0  0.7
/var/ossec/framework/python  0.0  0.7
/var/ossec/framework/python  0.1  1.1
/usr/share/wazuh-dashboard/  1.0  2.3
/var/ossec/bin/wazuh-module 67.8 35.0

I stopped the wazuh-manager service and restarted the wazuh-indexer one, and so far, the indexer managed to stay up.

f-galland commented 7 months ago

Closing it as no action is needed on behalf of the Indexer team.