Liveness probes run too often in the example deployment

rabbitmq / rabbitmq-peer-discovery-k8s

Kubernetes-based peer discovery mechanism for RabbitMQ

Other

295 stars 94 forks source link

Liveness probes run too often in the example deployment #28

Closed Adiqq closed 6 years ago

Adiqq commented 6 years ago

Hi,

There are probes in example:

        livenessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 10
          timeoutSeconds: 10

In our cluster, these commands were executed every 10 seconds which led to enormous CPU usage (70%) for rabbitmq cluster without any workload. It would be nice to mention that this command is so heavyweight and provide some sane periodSeconds for example.

michaelklishin commented 6 years ago

Is there any profiling or CPU context switching data that suggests that rabbitmqctl status is the main contributor as opposed to, say, a non-optimal runtime scheduler-to-core binding strategy? We would really prefer to not guess.

What would a "sane" value look like? Also, where should such notes go?

lukebakken commented 6 years ago

I would be interested to know how long the "enormous CPU usage (70%) " lasted - I assume it was brief. Also, it would be good to know your cluster size, RabbitMQ version, and Erlang version.

michaelklishin commented 6 years ago

Running the liveness probe every 60 seconds sounds more reasonable than every 10. I refrained from using rabbitmqctl node_health_check since a failing probe will lead to a node restart, which is too much e.g. for a node in a resource alarm state (and unlikely to help in the medium term).

Will add some remarks that this is just an example and should be treated as such.

Adiqq commented 6 years ago

@lukebakken I left rabbitmq cluster for few hours, it was constant usage at ~70% CPU, at the beginning I thought it was bug or issue with alpine rabbitmq image, but disabling probes fixed problem and now it uses ~1% CPU. I used https://hub.docker.com/_/rabbitmq/ , rabbitmq:3.7.5-management-alpine with 3 rabbitmq pods.

macropin commented 6 years ago

The high cpu is due to this issue https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63 However that PR does not fully resolve the issue as the liveness probes do not inherit the ulimit.

michaelklishin commented 6 years ago

This plugin has nothing to do with OS limits. It just performs peer discovery.