wazuh / wazuh-agent

Wazuh agent, the Wazuh agent for endpoints.
GNU Affero General Public License v3.0
27 stars 18 forks source link

wazuh-logcollector: ERROR: socketerr (not available) problem. #10

Closed rustybofh closed 4 months ago

rustybofh commented 5 months ago

Hi, I’m running Wazuh agent 4.7.4 on pfSense 2.7 and I keep getting these errors even though it’s working and sending information to the manager. I’ve tried changing agent parameters like queue size and events per second, but the issue persists. The manager version is 4.8, and for more context, I have Suricata running on pfSense, but even stopping Suricata on the interfaces doesn’t resolve the problem.

I´ve got this:

2024/06/13 17:34:42 wazuh-logcollector: INFO: Successfully reconnected to 'queue/sockets/queue' 2024/06/13 17:34:42 wazuh-logcollector: ERROR: socketerr (not available). 2024/06/13 17:34:42 wazuh-logcollector: ERROR: Unable to send message to 'queue/sockets/queue' (wazuh-agentd might be down). Attempting to reconnect. 2024/06/13 17:34:42 wazuh-logcollector: INFO: Successfully reconnected to 'queue/sockets/queue' 2024/06/13 17:34:42 wazuh-logcollector: ERROR: socketerr (not available). 2024/06/13 17:34:42 wazuh-logcollector: ERROR: Unable to send message to 'queue/sockets/queue' after a successfull reconnection... 2024/06/13 17:34:42 wazuh-logcollector: ERROR: socketerr (not available). 2024/06/13 17:34:42 wazuh-logcollector: ERROR: Unable to send message to 'queue/sockets/queue' (wazuh-agentd might be down). Attempting to reconnect. 2024/06/13 17:34:42 wazuh-logcollector: INFO: Successfully reconnected to 'queue/sockets/queue

I have tried changing parameters in the local.conf, but nothing has worked. The agent appears fine in the manager and there are no enrollment issues. There is no extra hop from the manager to the pfSense, and the other agents in various locations do not have this issue. I also do not see any firewall blocks.

Any hints or suggestions would be appreciated.

Thanks!

Update: This issue only occurs when monitoring the WAN interface. It does not happen when monitoring only the LAN interface

vikman90 commented 4 months ago

Hi @rustybofh

The Wazuh agent is divided into multiple processes that communicate through a local socket: /var/ossec/queue/sockets/queue. The wazuh-agent process exposes this socket so that collectors (e.g., Logcollector, FIM, etc.) can send messages to the manager.

Trying to reproduce

The most common reason for a disconnection is that wazuh-agentd crashes. We see this is not the case here, as it can reconnect immediately. I suspect the issue lies with the internal buffer of that socket (provided by the operating system), so I conducted a proof of concept which worked without issues on Linux:

queue.py ```python #!/usr/bin/env python3 # Print messages on Wazuh's queue (analysisd/agentd) # # Syntax: queue.py [-L] [PATH] # Reads a line from stdin # Standard message form: :: # # Example: # echo '1:test:Hello World' | sudo ./queue.py -L import argparse from socket import socket, AF_UNIX, SOCK_DGRAM, SO_SNDBUF, SOL_SOCKET from sys import argv ADDR = '/var/ossec/queue/sockets/queue' BLEN = 212992 def connect(addr, blen): sock = socket(AF_UNIX, SOCK_DGRAM) sock.connect(addr) oldbuf = sock.getsockopt(SOL_SOCKET, SO_SNDBUF) if oldbuf < blen: sock.setsockopt(SOL_SOCKET, SO_SNDBUF, blen) newbuf = sock.getsockopt(SOL_SOCKET, SO_SNDBUF) print("INFO: Buffer expended from {0} to {1}".format(oldbuf, newbuf)) return sock if __name__ == '__main__': parser = argparse.ArgumentParser(description="Print messages on Wazuh's queue") parser.add_argument('-L', '--loop', action='store_true', dest='loop', help='enable loop mode') parser.add_argument('PATH', nargs='?', default=ADDR, help='override default queue path') args = parser.parse_args() string = input().encode() sock = connect(args.PATH, BLEN) if args.loop: i = 0 try: while True: sock.send(string) i += 1 except BaseException as e: print(e) print("Messages: {0}\nBytes: {1}".format(i, i * len(string))) else: string = ' '.join(argv[1:]) sock.close() ```

As I understand it, pfSense is based on FreeBSD. I don't have pfSense, but I tested this on FreeBSD and encountered this error:

[Errno 55] No buffer space available

Rationale

This demonstrates a difference between the two platforms: if the socket memory fills up (because Logcollector generates more messages than the agent can handle), Linux performs an implicit wait (causing Logcollector to wait until space is available), while BSD generates an error code.

In fact, this is how I tested it on FreeBSD: by enabling logcollector.debug=2 and inserting numerous logs into a file, Logcollector produced this warning:

2024/06/18 08:39:23 wazuh-logcollector[16057] mq_op.c:127 at SendMSGAction(): DEBUG: Socket busy, discarding message.

So, my hypothesis is:

If this is correct, and seeing that Logcollector reconnects successfully, in terms of code, the end effect is nearly the same (aside from the error printed in the log).

Additionally, pfSense and FreeBSD are not officially supported, so I don't believe we can prioritize development to eliminate the error message.

Workaround

If my hypothesis is valid, and this is due to a capacity issue, I believe we can implement a workaround with the configuration:

I hope this helps.

Best regards.