Problem upgrading napalm-logs to 0.11.0 (python 3)

teresalves commented 7 months ago

Hey @mirceaulinic,

We previously had a python2 version deployed from commit 94ff8ad31d5135401b65d396760cac8d765ba805 Since the upgrade, we noticed that a significant amount of messages have been dropped and we so not know why.

Listener: kafka Publisher: kafka and zmq Config file:

listener:
      kafka:
        bootstrap_servers:
          - stuff
        topic: <reading topic>
    publisher:
      - kafka:
          bootstrap_servers:
            - stuff
          topic: <processed topic>
          send_raw: True
          send_unknown: True
          serializer: json
          no_encrypt: True
      - zmq:
          error_whitelist:
            - BGP_MD5_INCORRECT
            - BGP_PEER_NOT_CONFIGURED
            - BGP_PREFIX_THRESH_EXCEEDED
            - CONFIGURATION_COMMIT_REQUESTED
            - AGENT_INITIALIZED
    buffer:
      mem:
        expire_time: 30

The messages that do not show for kafka are also not showing for zmq so it does not look publisher specific. The amount of messages being consumed by the listener did not go down either.

There are absolutely no errors in the logs. We saw an example of two seemingly equal messages (apart from the process id) were in the router but only one reached kafka:

(...)
Mar 19 03:23:08  router mgd[86521]: UI_COMMIT_COMPLETED:  : commit complete
(...)
Mar 19 03:24:23  router mgd[86605]: UI_COMMIT_COMPLETED:  : commit complete

We did some manual commits and we didn't see the message arriving in either zmq or kafka (but when testing again with py2 we saw it passing though in both cases)

We have increased the hwm from 1000 to 10000 We have patched to add an extra backlog option to the pub_proxy and internal zmq publisher + self.sub.setsockopt(zmq.BACKLOG, 5000) (line18)

You can evidently see from this graph when the upgrade happened:

We are not hitting memory or cpu limits but I do notice that if I increase the log level I get less messages being produced. Could threading be influencing this too?

Using bullseye base image with python 3.9 zmq version: 20.0.0 (previously was 19.0.2) kafka python version: 2.0.1-2 (previously was 1.3.5)

Has anyone seen this before? Or do you have any idea how we can approach this now? Could there be a particular zmq breaking change between python2 and python3?

Thanks in advance!

mirceaulinic commented 7 months ago

Hey @teresalves - do you get different results / experience if you disable the buffer interface (i.e., not enable it at all) and set the ZeroMQ HWM to 0 (hwm: 0 in the napalm-logs config)?

teresalves commented 7 months ago

I had tried disabling the buffer but not the hwm to 0. I tried that combo and I got an error

Process Process-25:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/napalm_logs/server.py", line 321, in start
    self.pub.send_multipart(
  File "/usr/local/lib/python3.9/dist-packages/zmq/sugar/socket.py", line 553, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.9/dist-packages/zmq/sugar/socket.py", line 505, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

And the production just dropped

Then I deleted the hwm flag and just deleted the buffer and it produced again, but not to a volume as we had for python2. I had tested this last scenario before, and saw that the amount of COMMIT messages showing up did not increase. So I guess it only increased because we arent filtering out those unnecesary repeated messages, as the buffer should do.

teresalves commented 7 months ago

Hey @mirceaulinic ! Sorry to bother you but do we have any ideas on what we should try here? I tried changing some of the threading to multi processing but it made no difference. I am totally out of ideas, any input would be greatly appreciated 🙏 Have a nice weekend!

teresalves commented 6 months ago

Hey @mirceaulinic ! A bit of an update, we significantly improved this by simply scaling up from 1 pod to 4 pods. We also had some improvements when tweaking the kafka buffer values by patching the file: b/napalm_logs/transport/kafka.py -

             self.producer = kafka.KafkaProducer(
-                bootstrap_servers=self.bootstrap_servers
+                bootstrap_servers=self.bootstrap_servers,
+                batch_size=32768,
+                max_request_size=20971520,
+                buffer_memory=67108864,
+                send_buffer_bytes=1310720,
+                max_in_flight_requests_per_connection=50,
+                retries=3,
+                max_block_ms=30000,
+                linger_ms=1000

We also incremented the zmq buffer and added a new zmq backlog: +ZMQ_INTERNAL_HWM = 10000(instead of 1000 - but there is a flag for this, no real need to patch) and

         self.sub.setsockopt(zmq.SNDHWM, self.hwm)
+        self.sub.setsockopt(zmq.BACKLOG, 5000)

-> not a definite value, but certainly worth leaving it configurable

New problem: We have this in a kubernetes deployment which means that we have a load balancer that hits only one of the pods at a time. If we have a zmq consumer using said ingress, then it will only hit one of the pods and we lose 3/4 of the messages. So AFAIK, combining multiple pods with the intrinsic zmq producer does not work.

Other problems we found throughout the investigation:

We had to add a cast to bytes for the kafka transport because the python3 producer only accepts bytes and the zmq only accepts strings. In transport/kafka.py:

     def publish(self, obj):
+        if isinstance(obj, str):
+            obj = obj.encode()

There is no available flag for turning off the metrics attributes metrics_include_attributes and they are really a lot, so we had to turn them off because our prometheus hated it:
if self.opts.get("metrics_include_attributes", True):
if self.opts.get("metrics_include_attributes", False):
just some small logs that were set to info instead of debug and made it somewhat hard to search through everything. For example:
log.info(
"Configured regex did not match for os: %s tag %s",
log.debug(
"Configured regex did not match for os: %s tag %s, all %s", self._name, msg_dict.get("tag", ""),
msg_dict )

Small note:

I believe that this never worked at 100%, because now with the kafka and zmq tweaks we are producing more than before and I still do not see duplicated messages. I think it will need more improvements over time. I can do a PR for all the things that I have found, and leave the kafka values configurable (instead or hardcoded) for future generations.

I hope this helps! I'll create a PR soon too.

napalm-automation / napalm-logs

Problem upgrading napalm-logs to 0.11.0 (python 3) #382