Open teresalves opened 7 months ago
Hey @teresalves - do you get different results / experience if you disable the buffer
interface (i.e., not enable it at all) and set the ZeroMQ HWM to 0 (hwm: 0
in the napalm-logs config)?
I had tried disabling the buffer but not the hwm to 0. I tried that combo and I got an error
Process Process-25:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/dist-packages/napalm_logs/server.py", line 321, in start
self.pub.send_multipart(
File "/usr/local/lib/python3.9/dist-packages/zmq/sugar/socket.py", line 553, in send_multipart
self.send(msg, SNDMORE | flags, copy=copy, track=track)
File "/usr/local/lib/python3.9/dist-packages/zmq/sugar/socket.py", line 505, in send
return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket
And the production just dropped
Then I deleted the hwm flag and just deleted the buffer and it produced again, but not to a volume as we had for python2. I had tested this last scenario before, and saw that the amount of COMMIT messages showing up did not increase. So I guess it only increased because we arent filtering out those unnecesary repeated messages, as the buffer should do.
Hey @mirceaulinic ! Sorry to bother you but do we have any ideas on what we should try here? I tried changing some of the threading to multi processing but it made no difference. I am totally out of ideas, any input would be greatly appreciated 🙏 Have a nice weekend!
Hey @mirceaulinic ! A bit of an update, we significantly improved this by simply scaling up from 1 pod to 4 pods. We also had some improvements when tweaking the kafka buffer values by patching the file: b/napalm_logs/transport/kafka.py -
self.producer = kafka.KafkaProducer(
- bootstrap_servers=self.bootstrap_servers
+ bootstrap_servers=self.bootstrap_servers,
+ batch_size=32768,
+ max_request_size=20971520,
+ buffer_memory=67108864,
+ send_buffer_bytes=1310720,
+ max_in_flight_requests_per_connection=50,
+ retries=3,
+ max_block_ms=30000,
+ linger_ms=1000
We also incremented the zmq buffer and added a new zmq backlog:
+ZMQ_INTERNAL_HWM = 10000
(instead of 1000 - but there is a flag for this, no real need to patch)
and
self.sub.setsockopt(zmq.SNDHWM, self.hwm)
+ self.sub.setsockopt(zmq.BACKLOG, 5000)
-> not a definite value, but certainly worth leaving it configurable
New problem: We have this in a kubernetes deployment which means that we have a load balancer that hits only one of the pods at a time. If we have a zmq consumer using said ingress, then it will only hit one of the pods and we lose 3/4 of the messages. So AFAIK, combining multiple pods with the intrinsic zmq producer does not work.
Other problems we found throughout the investigation:
def publish(self, obj):
+ if isinstance(obj, str):
+ obj = obj.encode()
There is no available flag for turning off the metrics attributes metrics_include_attributes
and they are really a lot, so we had to turn them off because our prometheus hated it:
if self.opts.get("metrics_include_attributes", True):
if self.opts.get("metrics_include_attributes", False):
just some small logs that were set to info instead of debug and made it somewhat hard to search through everything. For example:
log.info(
"Configured regex did not match for os: %s tag %s",
log.debug(
"Configured regex did not match for os: %s tag %s, all %s", self._name, msg_dict.get("tag", ""),
msg_dict )
Small note:
I hope this helps! I'll create a PR soon too.
Hey @mirceaulinic,
We previously had a python2 version deployed from commit 94ff8ad31d5135401b65d396760cac8d765ba805 Since the upgrade, we noticed that a significant amount of messages have been dropped and we so not know why.
Listener: kafka Publisher: kafka and zmq Config file:
The messages that do not show for kafka are also not showing for zmq so it does not look publisher specific. The amount of messages being consumed by the listener did not go down either.
There are absolutely no errors in the logs. We saw an example of two seemingly equal messages (apart from the process id) were in the router but only one reached kafka:
We did some manual commits and we didn't see the message arriving in either zmq or kafka (but when testing again with py2 we saw it passing though in both cases)
We have increased the hwm from 1000 to 10000 We have patched to add an extra backlog option to the pub_proxy and internal zmq publisher
+ self.sub.setsockopt(zmq.BACKLOG, 5000)
(line18)You can evidently see from this graph when the upgrade happened:
We are not hitting memory or cpu limits but I do notice that if I increase the log level I get less messages being produced. Could threading be influencing this too?
Using bullseye base image with python 3.9 zmq version: 20.0.0 (previously was 19.0.2) kafka python version: 2.0.1-2 (previously was 1.3.5)
Has anyone seen this before? Or do you have any idea how we can approach this now? Could there be a particular zmq breaking change between python2 and python3?
Thanks in advance!