Closed davidcr01 closed 2 months ago
Access of the VM was shared to @AlexRuiz7 and @f-galland.
I logged in to the machine with the provided credentials and was able to asses that indeed, the service is not starting up because of a timeout.
[root@wazuh-server ~]# systemctl status wazuh-indexer
● wazuh-indexer.service - wazuh-indexer
Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Fri 2024-07-12 15:16:59 UTC; 3h 17min ago
Docs: https://documentation.wazuh.com
Process: 2404 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (code=exited, status=143)
Main PID: 2404 (code=exited, status=143)
Jul 12 15:16:12 wazuh-server systemd-entrypoint[2404]: Jul 12, 2024 3:16:11 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
Jul 12 15:16:12 wazuh-server systemd-entrypoint[2404]: WARNING: COMPAT locale provider will be removed in a future release
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: A terminally deprecated method in java.lang.System has been called
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/wazuh-in....13.0.jar)
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: System::setSecurityManager will be removed in a future release
Jul 12 15:16:59 wazuh-server systemd[1]: wazuh-indexer.service start operation timed out. Terminating.
Jul 12 15:16:59 wazuh-server systemd[1]: Failed to start wazuh-indexer.
Jul 12 15:16:59 wazuh-server systemd[1]: Unit wazuh-indexer.service entered failed state.
Jul 12 15:16:59 wazuh-server systemd[1]: wazuh-indexer.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
However I would need to review the AMI build and deploy logs to determine the cause for this issue, since issuing a:
systemctl restart wazuh-indexer
within the instance solves the problem.
I have observed the following errors in the wazuh-cluster.log
file.
[2024-07-15T00:00:00,323][WARN ][o.o.p.c.u.JsonConverter ] [node-1] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
[2024-07-15T00:00:00,341][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-07-15T00:00:00,349][INFO ][o.o.p.PluginsService ] [node-1] PluginService:onIndexModule index:[wazuh-statistics-2024.29w/u_MtaqwGSy-I4a8QIEn0Ew]
[2024-07-15T00:00:00,355][INFO ][o.o.c.m.MetadataMappingService] [node-1] [wazuh-statistics-2024.29w/u_MtaqwGSy-I4a8QIEn0Ew] update_mapping [_doc]
[2024-07-15T00:00:00,375][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-07-15T00:00:00,376][INFO ][o.o.c.r.a.AllocationService] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[wazuh-monitoring-2024.29w][0]]]).
[2024-07-15T00:00:00,390][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-07-15T00:00:00,394][INFO ][o.o.c.m.MetadataUpdateSettingsService] [node-1] updating number_of_replicas to [0] for indices [wazuh-monitoring-2024.29w]
[2024-07-15T00:00:05,323][WARN ][o.o.p.c.u.JsonConverter ] [node-1] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
[2024-07-15T00:00:10,323][WARN ][o.o.p.c.u.JsonConverter ] [node-1] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
@davidcr01
That log is dated today, so I'm not sure it's connected to the issue.
The full wazuh-cluster.log
from the first start attempt follows:
We can notice an actual ERROR
level log there.
[2024-07-12T15:16:52,028][ERROR][o.o.p.c.j.GCMetrics ] [node-1] MX bean missing: G1 Concurrent GC
which I'm currently investigating.
We reviewed the AMI build process with @davidcr01 and determined this could be caused by the JVM's memory allocation pool as controlled by the Xmx
and Xms
variables within jvm.options
.
A script is currently setting these values based on the total RAM amount, but under some circumstances, this may be too high (elastic recommends using half the available memory).
#!/bin/sh
# Configure JVM options for Wazuh indexer
ram_mb=$(free -m | awk '/^Mem:/{print $2}')
ram="$(( ram_mb / 2 ))"
if [ "${ram}" -eq "0" ]; then
ram=1024;
fi
regex="^\-Xmx\K[0-9]+"
file="/etc/wazuh-indexer/jvm.options"
value=$(grep -oP ${regex} ${file})
if [[ "${value}" != "${ram}" ]]; then
eval "sed -i "s/^-Xms.*$/-Xms${ram}m/" ${file} ${debug}"
eval "sed -i "s/^-Xmx.*$/-Xmx${ram}m/" ${file} ${debug}"
fi
systemctl stop updateIndexerHeap.service
Hard coding the Xmx
and Xms
values to 1024m
got rid of the G1 GC
error in the logs, but the indexer is still exiting after a timeout.
It looks like wazuh-indexer
is getting started before the network interfaces are set up:
[root@wazuh-server wazuh-indexer]# journalctl --since 13:43:04 --until 13:44:21 | grep systemd
Jul 15 13:43:04 wazuh-server systemd[1]: Started Postfix Mail Transport Agent.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Initial cloud-init job (metadata service crawler).
Jul 15 13:43:04 wazuh-server systemd[1]: Reached target Cloud-config availability.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Permit User Sessions...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting OpenSSH Server Key Generation...
Jul 15 13:43:04 wazuh-server systemd[1]: Reached target Network is Online.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Wazuh manager...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Notify NFS peers of a restart...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Apply the settings specified in cloud-config...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting wazuh-indexer...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Finds and configures elastic network interfaces...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting System Logging Service...
Jul 15 13:43:04 wazuh-server systemd[1]: Started Filebeat sends log files to Logstash or directly to Elasticsearch..
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Dynamically Generate Message Of The Day...
Jul 15 13:43:04 wazuh-server systemd[1]: Started amazon-ssm-agent.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Permit User Sessions.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Notify NFS peers of a restart.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Finds and configures elastic network interfaces.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 13:43:04 wazuh-server systemd[1]: Started Job spooling tools.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 15 13:43:05 wazuh-server systemd[1]: Started Command Scheduler.
Jul 15 13:43:05 wazuh-server systemd[1]: Received SIGRTMIN+21 from PID 1623 (plymouthd).
Jul 15 13:43:05 wazuh-server systemd[1]: Started Dynamically Generate Message Of The Day.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Getty on tty1.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Serial Getty on ttyS0.
Jul 15 13:43:05 wazuh-server systemd[1]: Reached target Login Prompts.
Jul 15 13:43:05 wazuh-server systemd[1]: Started System Logging Service.
Jul 15 13:43:05 wazuh-server systemd[1]: Started OpenSSH Server Key Generation.
Jul 15 13:43:05 wazuh-server systemd[1]: Starting OpenSSH server daemon...
Jul 15 13:43:05 wazuh-server systemd[1]: Started OpenSSH server daemon.
Jul 15 13:43:05 wazuh-server systemd[1]: Created slice User Slice of root.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Session 1 of user root.
Jul 15 13:43:07 wazuh-server systemd[1]: Time has been changed
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: A terminally deprecated method in java.lang.System has been called
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.OpenSearch (file:/usr/share/wazuh-indexer/lib/opensearch-2.13.0.jar)
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.OpenSearch
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager will be removed in a future release
Jul 15 13:43:30 wazuh-server systemd-entrypoint[2324]: Jul 15, 2024 1:43:30 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
Jul 15 13:43:30 wazuh-server systemd-entrypoint[2324]: WARNING: COMPAT locale provider will be removed in a future release
Jul 15 13:43:31 wazuh-server systemd[1]: Started Apply the settings specified in cloud-config.
Jul 15 13:43:31 wazuh-server systemd[1]: Starting Initial hibernation setup job...
Jul 15 13:43:31 wazuh-server systemd[1]: Starting Execute cloud user/final scripts...
Jul 15 13:43:31 wazuh-server systemd[1]: Started Initial hibernation setup job.
Jul 15 13:43:32 wazuh-server systemd[1]: Started Session c1 of user root.
Jul 15 13:43:32 wazuh-server systemd[1]: Started Session c2 of user root.
Jul 15 13:43:32 wazuh-server systemd[1]: Stopping OpenSSH server daemon...
Jul 15 13:43:32 wazuh-server systemd[1]: Stopped OpenSSH server daemon.
Jul 15 13:43:32 wazuh-server systemd[1]: Stopped OpenSSH Server Key Generation.
Jul 15 13:43:32 wazuh-server systemd[1]: Stopping OpenSSH Server Key Generation...
Jul 15 13:43:32 wazuh-server systemd[1]: Starting OpenSSH server daemon...
Jul 15 13:43:32 wazuh-server systemd[1]: Started OpenSSH server daemon.
Jul 15 13:43:32 wazuh-server systemd[1]: Started Execute cloud user/final scripts.
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: A terminally deprecated method in java.lang.System has been called
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/wazuh-indexer/lib/opensearch-2.13.0.jar)
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager will be removed in a future release
Jul 15 13:43:44 wazuh-server systemd-logind[1850]: New session 2 of user wazuh-user.
Jul 15 13:43:44 wazuh-server systemd[1]: Created slice User Slice of wazuh-user.
Jul 15 13:43:44 wazuh-server systemd[1]: Started Session 2 of user wazuh-user.
Jul 15 13:44:20 wazuh-server systemd[1]: Started Wazuh manager.
Jul 15 13:44:20 wazuh-server systemd[1]: wazuh-indexer.service start operation timed out. Terminating.
Closer inspection on /var/log/messages
indicates that the interfaces are indeed ready by the time the indexer starts:
Jul 15 13:43:02 wazuh-server cloud-init: Cloud-init v. 19.3-46.amzn2.0.2 running 'init' at Mon, 15 Jul 2024 13:43:02 +0000. Up 31.59 seconds.
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | eth0 | True | 172.31.36.215 | 255.255.240.0 | global | 0e:e7:8f:f7:d3:35 |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | eth0 | True | fe80::ce7:8fff:fef7:d335/64 | . | link | 0e:e7:8f:f7:d3:35 |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | lo | True | ::1/128 | . | host | . |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | 0 | 0.0.0.0 | 172.31.32.1 | 0.0.0.0 | eth0 | UG |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | 1 | 169.254.169.254 | 0.0.0.0 | 255.255.255.255 | eth0 | UH |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | 2 | 172.31.32.0 | 0.0.0.0 | 255.255.240.0 | eth0 | U |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-------------+---------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | Route | Destination | Gateway | Interface | Flags |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-------------+---------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | 9 | fe80::/64 | :: | eth0 | U |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | 11 | local | :: | eth0 | U |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | 12 | ff00::/8 | :: | eth0 | U |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-------------+---------+-----------+-------+
Jul 15 13:43:03 wazuh-server dhclient[2124]: XMT: Solicit on eth0, interval 2130ms.
Jul 15 13:43:04 wazuh-server systemd: Started Postfix Mail Transport Agent.
Jul 15 13:43:04 wazuh-server systemd: Started Initial cloud-init job (metadata service crawler).
Jul 15 13:43:04 wazuh-server systemd: Reached target Cloud-config availability.
Jul 15 13:43:04 wazuh-server systemd: Starting Permit User Sessions...
Jul 15 13:43:04 wazuh-server systemd: Starting OpenSSH Server Key Generation...
Jul 15 13:43:04 wazuh-server systemd: Reached target Network is Online.
Jul 15 13:43:04 wazuh-server systemd: Starting Wazuh manager...
Jul 15 13:43:04 wazuh-server systemd: Starting Notify NFS peers of a restart...
Jul 15 13:43:04 wazuh-server systemd: Starting Apply the settings specified in cloud-config...
Jul 15 13:43:04 wazuh-server systemd: Starting wazuh-indexer...
I also tried starving the system's RAM to check if that could trigger a timeout, but it exits with a code instead:
docker run --rm colinianking/stress-ng --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.7;}' < /proc/meminfo)k --vm-keep -m 1
[root@wazuh-server ~]# systemctl start wazuh-indexer
Job for wazuh-indexer.service failed because the control process exited with error code. See "systemctl status wazuh-indexer.service" and "journalctl -xe" for details.
[root@wazuh-server ~]# systemctl status wazuh-indexer
● wazuh-indexer.service - wazuh-indexer
Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-07-15 17:57:56 UTC; 14s ago
Docs: https://documentation.wazuh.com
Process: 5698 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (code=exited, status=1/FAILURE)
Main PID: 5698 (code=exited, status=1/FAILURE)
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: output:
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: #
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # There is insufficient memory for the Java Runtime Environment to continue.
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # Native memory allocation (mmap) failed to map 4120903680 bytes for committing reserved memory.
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # An error report file with more information is saved as:
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # /var/log/wazuh-indexer/hs_err_pid5847.log
Jul 15 17:57:56 wazuh-server systemd[1]: wazuh-indexer.service: main process exited, code=exited, status=1/FAILURE
Jul 15 17:57:56 wazuh-server systemd[1]: Failed to start wazuh-indexer.
Jul 15 17:57:56 wazuh-server systemd[1]: Unit wazuh-indexer.service entered failed state.
Jul 15 17:57:56 wazuh-server systemd[1]: wazuh-indexer.service failed.
I brought this subject up with the support team and they mention this is a common issue when the disk is under stress and that a common fix is to increase the systemd service timeout. We might want to try that approach.
The old packages used to use a timeout of 180 seconds, while the newer ones generated from the wazuh-indexer repo, are currently using 75.
We need to bump this up and test the AMI deploy again.
Describe the bug
In https://github.com/wazuh/internal-devel-requests/issues/1262, we have detected that the Wazuh dashboard can not load because the Wazuh indexer service is failed after the AMI is built:
To Reproduce Steps to reproduce the behavior:
Expected behavior The Wazuh indexer should be running.
Host/Environment (please complete the following information):
Additional context If the indexer service is restarted, everything works as expected.