wazuh / wazuh-indexer

Wazuh indexer, the Wazuh search engine
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
11 stars 17 forks source link

[BUG] Wazuh indexer is failed in the AMI deployment #307

Closed davidcr01 closed 2 months ago

davidcr01 commented 2 months ago

Describe the bug

In https://github.com/wazuh/internal-devel-requests/issues/1262, we have detected that the Wazuh dashboard can not load because the Wazuh indexer service is failed after the AMI is built:

-- Logs begin at vie 2024-07-12 14:20:09 UTC, end at vie 2024-07-12 14:41:35 UTC. --
jul 12 14:20:47 wazuh-server systemd[1]: Starting wazuh-indexer...
-- Subject: Unit wazuh-indexer.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit wazuh-indexer.service has begun starting up.
jul 12 14:21:12 wazuh-server systemd-entrypoint[2270]: WARNING: A terminally deprecated method in java.lang.System has been called
jul 12 14:21:12 wazuh-server systemd-entrypoint[2270]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.OpenSearch (file:/usr/share/wazuh-indexer/lib/opensearch-2.13.0.jar)
jul 12 14:21:12 wazuh-server systemd-entrypoint[2270]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.OpenSearch
jul 12 14:21:12 wazuh-server systemd-entrypoint[2270]: WARNING: System::setSecurityManager will be removed in a future release
jul 12 14:21:15 wazuh-server systemd-entrypoint[2270]: Jul 12, 2024 2:21:15 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
jul 12 14:21:15 wazuh-server systemd-entrypoint[2270]: WARNING: COMPAT locale provider will be removed in a future release
jul 12 14:21:19 wazuh-server systemd-entrypoint[2270]: WARNING: A terminally deprecated method in java.lang.System has been called
jul 12 14:21:19 wazuh-server systemd-entrypoint[2270]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/wazuh-indexer/lib/opensearch-2.13.0.jar)
jul 12 14:21:19 wazuh-server systemd-entrypoint[2270]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
jul 12 14:21:19 wazuh-server systemd-entrypoint[2270]: WARNING: System::setSecurityManager will be removed in a future release
jul 12 14:22:03 wazuh-server systemd[1]: wazuh-indexer.service start operation timed out. Terminating.
jul 12 14:22:03 wazuh-server systemd[1]: Failed to start wazuh-indexer.
-- Subject: Unit wazuh-indexer.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit wazuh-indexer.service has failed.
-- 
-- The result is failed.

To Reproduce Steps to reproduce the behavior:

  1. Access to the provided instance (credentials shared with private message)
  2. Check the Wazuh indexer status (should be failed)

Expected behavior The Wazuh indexer should be running.

Host/Environment (please complete the following information):

Additional context If the indexer service is restarted, everything works as expected.

davidcr01 commented 2 months ago

Access of the VM was shared to @AlexRuiz7 and @f-galland.

f-galland commented 2 months ago

I logged in to the machine with the provided credentials and was able to asses that indeed, the service is not starting up because of a timeout.

[root@wazuh-server ~]# systemctl status wazuh-indexer
● wazuh-indexer.service - wazuh-indexer
   Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Fri 2024-07-12 15:16:59 UTC; 3h 17min ago
     Docs: https://documentation.wazuh.com
  Process: 2404 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (code=exited, status=143)
 Main PID: 2404 (code=exited, status=143)

Jul 12 15:16:12 wazuh-server systemd-entrypoint[2404]: Jul 12, 2024 3:16:11 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
Jul 12 15:16:12 wazuh-server systemd-entrypoint[2404]: WARNING: COMPAT locale provider will be removed in a future release
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: A terminally deprecated method in java.lang.System has been called
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/wazuh-in....13.0.jar)
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
Jul 12 15:16:15 wazuh-server systemd-entrypoint[2404]: WARNING: System::setSecurityManager will be removed in a future release
Jul 12 15:16:59 wazuh-server systemd[1]: wazuh-indexer.service start operation timed out. Terminating.
Jul 12 15:16:59 wazuh-server systemd[1]: Failed to start wazuh-indexer.
Jul 12 15:16:59 wazuh-server systemd[1]: Unit wazuh-indexer.service entered failed state.
Jul 12 15:16:59 wazuh-server systemd[1]: wazuh-indexer.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

However I would need to review the AMI build and deploy logs to determine the cause for this issue, since issuing a:

systemctl restart wazuh-indexer

within the instance solves the problem.

davidcr01 commented 2 months ago

I have observed the following errors in the wazuh-cluster.log file.

[2024-07-15T00:00:00,323][WARN ][o.o.p.c.u.JsonConverter  ] [node-1] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
[2024-07-15T00:00:00,341][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-07-15T00:00:00,349][INFO ][o.o.p.PluginsService     ] [node-1] PluginService:onIndexModule index:[wazuh-statistics-2024.29w/u_MtaqwGSy-I4a8QIEn0Ew]
[2024-07-15T00:00:00,355][INFO ][o.o.c.m.MetadataMappingService] [node-1] [wazuh-statistics-2024.29w/u_MtaqwGSy-I4a8QIEn0Ew] update_mapping [_doc]
[2024-07-15T00:00:00,375][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-07-15T00:00:00,376][INFO ][o.o.c.r.a.AllocationService] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[wazuh-monitoring-2024.29w][0]]]).
[2024-07-15T00:00:00,390][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-07-15T00:00:00,394][INFO ][o.o.c.m.MetadataUpdateSettingsService] [node-1] updating number_of_replicas to [0] for indices [wazuh-monitoring-2024.29w]
[2024-07-15T00:00:05,323][WARN ][o.o.p.c.u.JsonConverter  ] [node-1] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
[2024-07-15T00:00:10,323][WARN ][o.o.p.c.u.JsonConverter  ] [node-1] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
f-galland commented 2 months ago

@davidcr01

That log is dated today, so I'm not sure it's connected to the issue.

The full wazuh-cluster.log from the first start attempt follows:

wazuh-cluster.log ``` [2024-07-12T15:16:15,508][INFO ][o.o.n.Node ] [node-1] version[2.13.0], pid[2404], build[rpm/f327a4177e4bfee79ce9096dd18a6769d6cf77bc/2024-06-20T18:27:36.506219Z], OS[Linux/4.14.348-265.565.amzn2.x86_64/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.2/21.0.2+13-LTS] [2024-07-12T15:16:15,510][INFO ][o.o.n.Node ] [node-1] JVM home [/usr/share/wazuh-indexer/jdk], using bundled JDK/JRE [true] [2024-07-12T15:16:15,510][INFO ][o.o.n.Node ] [node-1] JVM arguments [-Xshare:auto, -Dopensearch.networkaddress.cache.ttl=60, -Dopensearch.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.security.manager=allow, -Djava.locale.providers=SPI,COMPAT, -Xms3930m, -Xmx3930m, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/var/log/wazuh-indexer/tmp, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/wazuh-indexer, -XX:ErrorFile=/var/log/wazuh-indexer/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/wazuh-indexer/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.security.manager=allow, -Djava.util.concurrent.ForkJoinPool.common.threadFactory=org.opensearch.secure_sm.SecuredForkJoinWorkerThreadFactory, -Dclk.tck=100, -Djdk.attach.allowAttachSelf=true, -Djava.security.policy=file:///etc/wazuh-indexer/opensearch-performance-analyzer/opensearch_security.policy, --add-opens=jdk.attach/sun.tools.attach=ALL-UNNAMED, -XX:MaxDirectMemorySize=2060451840, -Dopensearch.path.home=/usr/share/wazuh-indexer, -Dopensearch.path.conf=/etc/wazuh-indexer, -Dopensearch.distribution.type=rpm, -Dopensearch.bundled_jdk=true] [2024-07-12T15:16:27,306][INFO ][o.o.s.s.t.SSLConfig ] [node-1] SSL dual mode is disabled [2024-07-12T15:16:27,306][INFO ][o.o.s.OpenSearchSecurityPlugin] [node-1] OpenSearch Config path is /etc/wazuh-indexer [2024-07-12T15:16:28,786][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] JVM supports TLSv1.3 [2024-07-12T15:16:28,789][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] Config directory is /etc/wazuh-indexer/, from there the key- and truststore files are resolved relatively [2024-07-12T15:16:30,147][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] TLS Transport Client Provider : JDK [2024-07-12T15:16:30,148][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] TLS Transport Server Provider : JDK [2024-07-12T15:16:30,149][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] TLS HTTP Provider : JDK [2024-07-12T15:16:30,149][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] Enabled TLS protocols for transport layer : [TLSv1.3, TLSv1.2] [2024-07-12T15:16:30,149][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node-1] Enabled TLS protocols for HTTP layer : [TLSv1.2] [2024-07-12T15:16:30,541][INFO ][o.o.s.OpenSearchSecurityPlugin] [node-1] Clustername: wazuh-cluster [2024-07-12T15:16:46,229][INFO ][o.o.p.c.c.PluginSettings ] [node-1] Trying to create directory /dev/shm/performanceanalyzer/. [2024-07-12T15:16:46,230][INFO ][o.o.p.c.c.PluginSettings ] [node-1] Config: metricsLocation: /dev/shm/performanceanalyzer/, metricsDeletionInterval: 1, httpsEnabled: false, cleanup-metrics-db-files: true, batch-metrics-retention-period-minutes: 7, rpc-port: 9650, webservice-port 9600 [2024-07-12T15:16:52,028][ERROR][o.o.p.c.j.GCMetrics ] [node-1] MX bean missing: G1 Concurrent GC [2024-07-12T15:16:56,605][INFO ][o.o.i.r.ReindexPlugin ] [node-1] ReindexPlugin reloadSPI called [2024-07-12T15:16:56,606][INFO ][o.o.i.r.ReindexPlugin ] [node-1] Unable to find any implementation for RemoteReindexExtension [2024-07-12T15:16:57,175][INFO ][o.o.j.JobSchedulerPlugin ] [node-1] Loaded scheduler extension: opendistro_anomaly_detector, index: .opendistro-anomaly-detector-jobs [2024-07-12T15:16:57,582][INFO ][o.o.j.JobSchedulerPlugin ] [node-1] Loaded scheduler extension: reports-scheduler, index: .opendistro-reports-definitions [2024-07-12T15:16:57,583][INFO ][o.o.j.JobSchedulerPlugin ] [node-1] Loaded scheduler extension: opendistro-index-management, index: .opendistro-ism-config [2024-07-12T15:16:57,584][INFO ][o.o.j.JobSchedulerPlugin ] [node-1] Loaded scheduler extension: scheduler_geospatial_ip2geo_datasource, index: .scheduler-geospatial-ip2geo-datasource [2024-07-12T15:16:57,586][INFO ][o.o.j.JobSchedulerPlugin ] [node-1] Loaded scheduler extension: opensearch_sap_job, index: .opensearch-sap--job [2024-07-12T15:16:57,723][INFO ][o.o.p.PluginsService ] [node-1] loaded module [aggs-matrix-stats] [2024-07-12T15:16:57,724][INFO ][o.o.p.PluginsService ] [node-1] loaded module [analysis-common] [2024-07-12T15:16:57,724][INFO ][o.o.p.PluginsService ] [node-1] loaded module [cache-common] [2024-07-12T15:16:57,724][INFO ][o.o.p.PluginsService ] [node-1] loaded module [geo] [2024-07-12T15:16:57,724][INFO ][o.o.p.PluginsService ] [node-1] loaded module [ingest-common] [2024-07-12T15:16:57,725][INFO ][o.o.p.PluginsService ] [node-1] loaded module [ingest-geoip] [2024-07-12T15:16:57,725][INFO ][o.o.p.PluginsService ] [node-1] loaded module [ingest-user-agent] [2024-07-12T15:16:57,725][INFO ][o.o.p.PluginsService ] [node-1] loaded module [lang-expression] [2024-07-12T15:16:57,725][INFO ][o.o.p.PluginsService ] [node-1] loaded module [lang-mustache] [2024-07-12T15:16:57,726][INFO ][o.o.p.PluginsService ] [node-1] loaded module [lang-painless] [2024-07-12T15:16:57,726][INFO ][o.o.p.PluginsService ] [node-1] loaded module [mapper-extras] [2024-07-12T15:16:57,726][INFO ][o.o.p.PluginsService ] [node-1] loaded module [opensearch-dashboards] [2024-07-12T15:16:57,726][INFO ][o.o.p.PluginsService ] [node-1] loaded module [parent-join] [2024-07-12T15:16:57,727][INFO ][o.o.p.PluginsService ] [node-1] loaded module [percolator] [2024-07-12T15:16:57,727][INFO ][o.o.p.PluginsService ] [node-1] loaded module [rank-eval] [2024-07-12T15:16:57,727][INFO ][o.o.p.PluginsService ] [node-1] loaded module [reindex] [2024-07-12T15:16:57,727][INFO ][o.o.p.PluginsService ] [node-1] loaded module [repository-url] [2024-07-12T15:16:57,727][INFO ][o.o.p.PluginsService ] [node-1] loaded module [search-pipeline-common] [2024-07-12T15:16:57,728][INFO ][o.o.p.PluginsService ] [node-1] loaded module [systemd] [2024-07-12T15:16:57,728][INFO ][o.o.p.PluginsService ] [node-1] loaded module [transport-netty4] [2024-07-12T15:16:57,728][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-alerting] [2024-07-12T15:16:57,729][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-anomaly-detection] [2024-07-12T15:16:57,729][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-asynchronous-search] [2024-07-12T15:16:57,729][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-cross-cluster-replication] [2024-07-12T15:16:57,729][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-geospatial] [2024-07-12T15:16:57,729][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-index-management] [2024-07-12T15:16:57,730][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-job-scheduler] [2024-07-12T15:16:57,730][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-knn] [2024-07-12T15:16:57,730][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-ml] [2024-07-12T15:16:57,730][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-neural-search] [2024-07-12T15:16:57,730][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-notifications] [2024-07-12T15:16:57,731][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-notifications-core] [2024-07-12T15:16:57,731][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-observability] [2024-07-12T15:16:57,731][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-performance-analyzer] [2024-07-12T15:16:57,731][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-reports-scheduler] [2024-07-12T15:16:57,731][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-security] [2024-07-12T15:16:57,732][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-security-analytics] [2024-07-12T15:16:57,732][INFO ][o.o.p.PluginsService ] [node-1] loaded plugin [opensearch-sql] [2024-07-12T15:16:57,966][INFO ][o.o.s.OpenSearchSecurityPlugin] [node-1] Disabled https compression by default to mitigate BREACH attacks. You can enable it by setting 'http.compression: true' in opensearch.yml [2024-07-12T15:16:58,077][INFO ][o.o.e.ExtensionsManager ] [node-1] ExtensionsManager initialized [2024-07-12T15:16:58,102][INFO ][o.a.l.s.MemorySegmentIndexInputProvider] [node-1] Using MemorySegmentIndexInput with Java 21 or later; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false [2024-07-12T15:16:58,110][INFO ][o.o.e.NodeEnvironment ] [node-1] using [1] data paths, mounts [[/ (/dev/nvme0n1p1)]], net usable_space [92gb], net total_space [99.9gb], types [xfs] [2024-07-12T15:16:58,111][INFO ][o.o.e.NodeEnvironment ] [node-1] heap size [3.8gb], compressed ordinary object pointers [true] [2024-07-12T15:16:58,375][INFO ][o.o.n.Node ] [node-1] node name [node-1], node ID [EY65EkDXTa2tOvJAsqanKw], cluster name [wazuh-cluster], roles [ingest, remote_cluster_client, data, cluster_manager] ```

We can notice an actual ERROR level log there.

[2024-07-12T15:16:52,028][ERROR][o.o.p.c.j.GCMetrics      ] [node-1] MX bean missing: G1 Concurrent GC

which I'm currently investigating.

f-galland commented 2 months ago

We reviewed the AMI build process with @davidcr01 and determined this could be caused by the JVM's memory allocation pool as controlled by the Xmx and Xms variables within jvm.options. A script is currently setting these values based on the total RAM amount, but under some circumstances, this may be too high (elastic recommends using half the available memory).

#!/bin/sh

# Configure JVM options for Wazuh indexer
ram_mb=$(free -m | awk '/^Mem:/{print $2}')
ram="$(( ram_mb / 2 ))"

if [ "${ram}" -eq "0" ]; then
    ram=1024;
fi

regex="^\-Xmx\K[0-9]+"
file="/etc/wazuh-indexer/jvm.options"
value=$(grep -oP ${regex} ${file})

if [[ "${value}" != "${ram}" ]]; then
    eval "sed -i "s/^-Xms.*$/-Xms${ram}m/" ${file} ${debug}"
    eval "sed -i "s/^-Xmx.*$/-Xmx${ram}m/" ${file} ${debug}"
fi

systemctl stop updateIndexerHeap.service
f-galland commented 2 months ago

Hard coding the Xmx and Xms values to 1024m got rid of the G1 GC error in the logs, but the indexer is still exiting after a timeout.

f-galland commented 2 months ago

It looks like wazuh-indexer is getting started before the network interfaces are set up:

[root@wazuh-server wazuh-indexer]# journalctl --since 13:43:04 --until 13:44:21 | grep systemd
Jul 15 13:43:04 wazuh-server systemd[1]: Started Postfix Mail Transport Agent.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Initial cloud-init job (metadata service crawler).
Jul 15 13:43:04 wazuh-server systemd[1]: Reached target Cloud-config availability.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Permit User Sessions...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting OpenSSH Server Key Generation...
Jul 15 13:43:04 wazuh-server systemd[1]: Reached target Network is Online.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Wazuh manager...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Notify NFS peers of a restart...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Apply the settings specified in cloud-config...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting wazuh-indexer...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Finds and configures elastic network interfaces...
Jul 15 13:43:04 wazuh-server systemd[1]: Starting System Logging Service...
Jul 15 13:43:04 wazuh-server systemd[1]: Started Filebeat sends log files to Logstash or directly to Elasticsearch..
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Dynamically Generate Message Of The Day...
Jul 15 13:43:04 wazuh-server systemd[1]: Started amazon-ssm-agent.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Permit User Sessions.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Notify NFS peers of a restart.
Jul 15 13:43:04 wazuh-server systemd[1]: Started Finds and configures elastic network interfaces.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Terminate Plymouth Boot Screen...
Jul 15 13:43:04 wazuh-server systemd[1]: Started Job spooling tools.
Jul 15 13:43:04 wazuh-server systemd[1]: Starting Wait for Plymouth Boot Screen to Quit...
Jul 15 13:43:05 wazuh-server systemd[1]: Started Command Scheduler.
Jul 15 13:43:05 wazuh-server systemd[1]: Received SIGRTMIN+21 from PID 1623 (plymouthd).
Jul 15 13:43:05 wazuh-server systemd[1]: Started Dynamically Generate Message Of The Day.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Terminate Plymouth Boot Screen.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Wait for Plymouth Boot Screen to Quit.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Getty on tty1.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Serial Getty on ttyS0.
Jul 15 13:43:05 wazuh-server systemd[1]: Reached target Login Prompts.
Jul 15 13:43:05 wazuh-server systemd[1]: Started System Logging Service.
Jul 15 13:43:05 wazuh-server systemd[1]: Started OpenSSH Server Key Generation.
Jul 15 13:43:05 wazuh-server systemd[1]: Starting OpenSSH server daemon...
Jul 15 13:43:05 wazuh-server systemd[1]: Started OpenSSH server daemon.
Jul 15 13:43:05 wazuh-server systemd[1]: Created slice User Slice of root.
Jul 15 13:43:05 wazuh-server systemd[1]: Started Session 1 of user root.
Jul 15 13:43:07 wazuh-server systemd[1]: Time has been changed
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: A terminally deprecated method in java.lang.System has been called
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.OpenSearch (file:/usr/share/wazuh-indexer/lib/opensearch-2.13.0.jar)
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.OpenSearch
Jul 15 13:43:27 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager will be removed in a future release
Jul 15 13:43:30 wazuh-server systemd-entrypoint[2324]: Jul 15, 2024 1:43:30 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
Jul 15 13:43:30 wazuh-server systemd-entrypoint[2324]: WARNING: COMPAT locale provider will be removed in a future release
Jul 15 13:43:31 wazuh-server systemd[1]: Started Apply the settings specified in cloud-config.
Jul 15 13:43:31 wazuh-server systemd[1]: Starting Initial hibernation setup job...
Jul 15 13:43:31 wazuh-server systemd[1]: Starting Execute cloud user/final scripts...
Jul 15 13:43:31 wazuh-server systemd[1]: Started Initial hibernation setup job.
Jul 15 13:43:32 wazuh-server systemd[1]: Started Session c1 of user root.
Jul 15 13:43:32 wazuh-server systemd[1]: Started Session c2 of user root.
Jul 15 13:43:32 wazuh-server systemd[1]: Stopping OpenSSH server daemon...
Jul 15 13:43:32 wazuh-server systemd[1]: Stopped OpenSSH server daemon.
Jul 15 13:43:32 wazuh-server systemd[1]: Stopped OpenSSH Server Key Generation.
Jul 15 13:43:32 wazuh-server systemd[1]: Stopping OpenSSH Server Key Generation...
Jul 15 13:43:32 wazuh-server systemd[1]: Starting OpenSSH server daemon...
Jul 15 13:43:32 wazuh-server systemd[1]: Started OpenSSH server daemon.
Jul 15 13:43:32 wazuh-server systemd[1]: Started Execute cloud user/final scripts.
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: A terminally deprecated method in java.lang.System has been called
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/wazuh-indexer/lib/opensearch-2.13.0.jar)
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
Jul 15 13:43:34 wazuh-server systemd-entrypoint[2324]: WARNING: System::setSecurityManager will be removed in a future release
Jul 15 13:43:44 wazuh-server systemd-logind[1850]: New session 2 of user wazuh-user.
Jul 15 13:43:44 wazuh-server systemd[1]: Created slice User Slice of wazuh-user.
Jul 15 13:43:44 wazuh-server systemd[1]: Started Session 2 of user wazuh-user.
Jul 15 13:44:20 wazuh-server systemd[1]: Started Wazuh manager.
Jul 15 13:44:20 wazuh-server systemd[1]: wazuh-indexer.service start operation timed out. Terminating.
f-galland commented 2 months ago

Closer inspection on /var/log/messages indicates that the interfaces are indeed ready by the time the indexer starts:

Jul 15 13:43:02 wazuh-server cloud-init: Cloud-init v. 19.3-46.amzn2.0.2 running 'init' at Mon, 15 Jul 2024 13:43:02 +0000. Up 31.59 seconds.
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |  eth0  | True |        172.31.36.215        | 255.255.240.0 | global | 0e:e7:8f:f7:d3:35 |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |  eth0  | True | fe80::ce7:8fff:fef7:d335/64 |       .       |  link  | 0e:e7:8f:f7:d3:35 |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   lo   | True |          127.0.0.1          |   255.0.0.0   |  host  |         .         |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   lo   | True |           ::1/128           |       .       |  host  |         .         |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | Route |   Destination   |   Gateway   |     Genmask     | Interface | Flags |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   0   |     0.0.0.0     | 172.31.32.1 |     0.0.0.0     |    eth0   |   UG  |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   1   | 169.254.169.254 |   0.0.0.0   | 255.255.255.255 |    eth0   |   UH  |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   2   |   172.31.32.0   |   0.0.0.0   |  255.255.240.0  |    eth0   |   U   |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-------------+---------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: | Route | Destination | Gateway | Interface | Flags |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-------------+---------+-----------+-------+
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   9   |  fe80::/64  |    ::   |    eth0   |   U   |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   11  |    local    |    ::   |    eth0   |   U   |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: |   12  |   ff00::/8  |    ::   |    eth0   |   U   |
Jul 15 13:43:02 wazuh-server cloud-init: ci-info: +-------+-------------+---------+-----------+-------+
Jul 15 13:43:03 wazuh-server dhclient[2124]: XMT: Solicit on eth0, interval 2130ms.
Jul 15 13:43:04 wazuh-server systemd: Started Postfix Mail Transport Agent.
Jul 15 13:43:04 wazuh-server systemd: Started Initial cloud-init job (metadata service crawler).
Jul 15 13:43:04 wazuh-server systemd: Reached target Cloud-config availability.
Jul 15 13:43:04 wazuh-server systemd: Starting Permit User Sessions...
Jul 15 13:43:04 wazuh-server systemd: Starting OpenSSH Server Key Generation...
Jul 15 13:43:04 wazuh-server systemd: Reached target Network is Online.
Jul 15 13:43:04 wazuh-server systemd: Starting Wazuh manager...
Jul 15 13:43:04 wazuh-server systemd: Starting Notify NFS peers of a restart...
Jul 15 13:43:04 wazuh-server systemd: Starting Apply the settings specified in cloud-config...
Jul 15 13:43:04 wazuh-server systemd: Starting wazuh-indexer...
f-galland commented 2 months ago

I also tried starving the system's RAM to check if that could trigger a timeout, but it exits with a code instead:

docker run --rm colinianking/stress-ng --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.7;}' < /proc/meminfo)k --vm-keep -m 1

[root@wazuh-server ~]# systemctl start wazuh-indexer
Job for wazuh-indexer.service failed because the control process exited with error code. See "systemctl status wazuh-indexer.service" and "journalctl -xe" for details.

[root@wazuh-server ~]# systemctl status wazuh-indexer
● wazuh-indexer.service - wazuh-indexer
   Loaded: loaded (/usr/lib/systemd/system/wazuh-indexer.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2024-07-15 17:57:56 UTC; 14s ago
     Docs: https://documentation.wazuh.com
  Process: 5698 ExecStart=/usr/share/wazuh-indexer/bin/systemd-entrypoint -p ${PID_DIR}/wazuh-indexer.pid --quiet (code=exited, status=1/FAILURE)
 Main PID: 5698 (code=exited, status=1/FAILURE)

Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: output:
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: #
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # There is insufficient memory for the Java Runtime Environment to continue.
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # Native memory allocation (mmap) failed to map 4120903680 bytes for committing reserved memory.
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # An error report file with more information is saved as:
Jul 15 17:57:56 wazuh-server systemd-entrypoint[5698]: # /var/log/wazuh-indexer/hs_err_pid5847.log
Jul 15 17:57:56 wazuh-server systemd[1]: wazuh-indexer.service: main process exited, code=exited, status=1/FAILURE
Jul 15 17:57:56 wazuh-server systemd[1]: Failed to start wazuh-indexer.
Jul 15 17:57:56 wazuh-server systemd[1]: Unit wazuh-indexer.service entered failed state.
Jul 15 17:57:56 wazuh-server systemd[1]: wazuh-indexer.service failed.
f-galland commented 2 months ago

I brought this subject up with the support team and they mention this is a common issue when the disk is under stress and that a common fix is to increase the systemd service timeout. We might want to try that approach.

f-galland commented 2 months ago

The old packages used to use a timeout of 180 seconds, while the newer ones generated from the wazuh-indexer repo, are currently using 75.

We need to bump this up and test the AMI deploy again.