vromero / activemq-artemis-helm

Helm chart for a cluster of ActiveMQ Artemis (Work in progress)
42 stars 68 forks source link

Production ready? review 2020 #43

Closed y0zg closed 4 years ago

y0zg commented 4 years ago

Hi @vromero ,

If we back into production readiness topic https://github.com/vromero/activemq-artemis-helm/issues/14

Are all these questions still open or probably some of them you managed to sort out? I'm mostly interested in prometheus metrics

  1. haven't decided if generate config or use KUBEPING
  2. artemis can't handle dynamic cluster sizes (the cluster with static size has to be formed on start), I have no idea what to do about this.
  3. haven't completed the integration with prometheus, a messaging broker without metrics/alarms is more a problem than a solution
  4. Not sure what to do about the loadbalancing. Today slave is not-ready but not ready messes up with things like helm install --wait or with deployment of replica>1 stateful sets. No idea yet what to do about this.
vromero commented 4 years ago

Sadly, IMHO still not production ready, although I'd love to hear from @DanSalt.

  1. Static config, no need for another dependency (and potential dependency hell) into the mix to do something static config can easily do.
  2. Still doesn't, the artemis devs are debating a new plugabble cluster system that hopefully would play better with k8s. What I've decided in the meantime is to keep the chart MASTER/SLAVE only. The reason is that kubernetes and artemis cluster overlap in funcionality. Also one common misunderstanding around this, artemis can form clusters but then inside the cluster there are only master/slave (or master without slave) pairs.
  3. This is a production blocker but can't say is difficult at all, just some work to do.
  4. This ones bothers me enormously. Slaves in Artemis keep the listener closed, and this is very confusing for kubernetes. I've envision a keepalive using scripts (and master-slave only without a loadbalancer, hence clients will have to declare the master and the slave) that perhaps would work but I need to actually go and try.
DanSalt commented 4 years ago

Hiya @vromero, @y0zg ..

1&2 - Agree this is a pain - even more so when you scale the cluster up/down - the static config gets very confused. We've been working on a jgroups-based configuration that avoids the static config -- not fully tested yet but potentially promising. 3 - We have full integration with Prometheus working no problem, and even created a Grafana dashboard that might be generically useful to people. Will take a look at what's different between our config and @vromero and see if we can do a PR (incl. the dashboard). 4 - This one is a problem, and I seem to flip-flop on this one in terms of my view. FWIW - we use Helm install without issues, but I think we may have tweaked the deployment to be Parallel for slaves - will take a look and post back. For the 'Not Ready' problem - I agree it's a pain, but we're living with it right now because (a) we can test deployment success based on the readiness of the service, not the pods, and (b) we can explain that having the slaves 'not ready' is actually allowing us to leverage more of the base Kubernetes capabilities (load balancing, etc.) than if we made them all "artificially ready" and solved balancing another way.

So your definition of "Production Ready" might vary depending on what capabilities you need. Today we deploy into mostly-static configurations (no scaling up/down outside of maintenance sessions), and we have worked around cluster status and readiness, so what we have today works for us. But if you need more, you may not consider it ready just yet.

y0zg commented 4 years ago

Thanks @DanSalt , Currently I'm mostly interested in Prometheus integration, and afterwards will consider whether we are ready to go with the whole solution. I'll be much appreciated if you have a look into item 3

DanSalt commented 4 years ago

@y0zg Will do. As a starting point, what happens when you try to enable metrics via setting this flag to true? Do you get any Artemis nodes showing up in Prometheus scrape config?

y0zg commented 4 years ago

@DanSalt currently we don't use prometheus operator and deploy prometheus via helm chart (we have plans to migrate to operator indeed) Currently I'm trying to understand which service:port should I use in prometheus job as there are different ones , would appreciate if you throw some light:

5556 https://github.com/vromero/activemq-artemis-helm/blob/master/activemq-artemis/values.yaml#L88

jmxexporter 9404 https://github.com/vromero/activemq-artemis-helm/blob/master/activemq-artemis/templates/master-statefulset.yaml#L90

jmx 9494 https://github.com/vromero/activemq-artemis-helm/blob/master/activemq-artemis/templates/master-service.yaml#L23

Thank you!

y0zg commented 4 years ago

I think I managed to solve this after changing all mentioned ports to port 9404

curl activemq-activemq-artemis-master.default.svc.cluster.local:9404 
100 38383  100 38383    0     0   457k      0 --:--:-- --:--:-- --:--:--  457k
# HELP jmx_config_reload_success_total Number of times configuration have successfully been reloaded.
# TYPE jmx_config_reload_success_total counter
jmx_config_reload_success_total 0.0
# HELP jmx_config_reload_failure_total Number of times configuration have failed to be reloaded.
# TYPE jmx_config_reload_failure_total counter
jmx_config_reload_failure_total 0.0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1022.29
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.589882491952E9
...
DanSalt commented 4 years ago

Hi @y0zg -- yes, exactly right, because Prometheus will scrape the actual endpoints, not the service endpoint.