k8s_statefulsets livenessProbe and readinessProbe incorrectly use command: ["rabbitmqctl", "status"]

AceHack commented 6 years ago

https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/blob/master/examples/k8s_statefulsets/rabbitmq.yaml

When reporting readiness and liveness to K8s in a StatefulSet one should not report healthy until the rabbit node is joined to the other rabbit nodes in the cluster. The status will report healthy even in the case where there is a wrong erlang cookie and the node was unable to join the cluster. This is a bug and should be addressed by using a different rabbit command that returns health based on cluster membership status as well.

What I am not suggesting is rabbit have any knowledge of K8s at all, I'm suggesting that rabbit should have awareness of its own cluster state no matter where it's running and have the ability for an individual node to report on its own rabbit cluster membership state even when running outside of any orchestrator.

Ideally, there would be a rabbitmqctl command that would return a non-zero (0) exit code when it fails to join any other members of a rabbit cluster, completely unrelated to K8s or any other orchestrator. This command then could be used for the readinessProbe in K8s.

I'm running on K8s v1.8.2 and using rabbitmq:3.7-alpine docker image.

I started a 3 node cluster with a randomly generated erlang secret, then I upscaled that cluster to 5 but the two new nodes had a different randomly generated erlang secret.

You can see the nodes fail to join in the logs from the original 3 node cluster.

2017-12-19 15:26:44.957 [error] <0.8251.0> Connection attempt from disallowed node 'rabbit@10.244.3.18' 2017-12-19 15:27:08.206 [error] <0.8324.0> Connection attempt from disallowed node 'rabbit@10.244.0.25'

rabbitmqctl status

bash-4.4# rabbitmqctl status
Status of node rabbit@10.244.0.26 ...
[{pid,344},
 {running_applications,
     [{rabbitmq_federation_management,"RabbitMQ Federation Management",
          "3.7.0"},
      {rabbitmq_federation,"RabbitMQ Federation","3.7.0"},
      {rabbitmq_consistent_hash_exchange,"Consistent Hash Exchange Type",
          "3.7.0"},
      {rabbitmq_shovel_management,
          "Management extension for the Shovel plugin","3.7.0"},
      {rabbitmq_amqp1_0,"AMQP 1.0 support for RabbitMQ","3.7.0"},
      {rabbitmq_management,"RabbitMQ Management Console","3.7.0"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.0"},
      {rabbitmq_mqtt,"RabbitMQ MQTT Adapter","3.7.0"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.0"},
      {rabbitmq_web_stomp,"Rabbit WEB-STOMP - WebSockets to Stomp adapter",
          "3.7.0"},
      {rabbitmq_peer_discovery_k8s,
          "Kubernetes-based RabbitMQ peer discovery backend","3.7.0"},
      {rabbitmq_peer_discovery_common,
          "Modules shared by various peer discovery backends","3.7.0"},
      {rabbitmq_stomp,"RabbitMQ STOMP plugin","3.7.0"},
      {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.7.0"},
      {rabbit,"RabbitMQ","3.7.0"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.0"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.0"},
      {recon,"Diagnostic tools for production use","2.3.2"},
      {ranch_proxy_protocol,"Ranch Proxy Protocol Transport","1.4.2"},
      {cowboy,"Small, fast, modern HTTP server.","2.0.0"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.4.0"},
      {amqp10_client,"AMQP 1.0 client from the RabbitMQ Project","3.7.0"},
      {ssl,"Erlang/OTP SSL application","8.2.2"},
      {public_key,"Public key infrastructure","1.5.1"},
      {asn1,"The Erlang ASN1 compiler version 5.0.3","5.0.3"},
      {cowlib,"Support library for manipulating Web protocols.","2.0.0"},
      {mnesia,"MNESIA  CXC 138 12","4.15.1"},
      {amqp10_common,
          "Modules shared by rabbitmq-amqp1.0 and rabbitmq-amqp1.0-client",
          "3.7.0"},
      {jsx,"a streaming, evented json parsing toolkit","2.8.2"},
      {os_mon,"CPO  CXC 138 46","2.4.3"},
      {crypto,"CRYPTO","4.1"},
      {xmerl,"XML parser","1.3.15"},
      {inets,"INETS  CXC 138 49","6.4.4"},
      {lager,"Erlang logging framework","3.5.1"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.1.3"},
      {syntax_tools,"Syntax tools","2.1.3"},
      {sasl,"SASL  CXC 138 11","3.1"},
      {stdlib,"ERTS  CXC 138 10","3.4.2"},
      {kernel,"ERTS  CXC 138 10","5.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:64] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,2840},
      {queue_procs,0},
      {queue_slave_procs,0},
      {plugins,1032392},
      {other_proc,21626568},
      {metrics,199000},
      {mgmt_db,152952},
      {mnesia,94032},
      {other_ets,2391592},
      {binary,517704},
      {msg_index,29104},
      {code,33961625},
      {atom,1476769},
      {other_system,30720734},
      {allocated_unused,37760768},
      {reserved_unallocated,1581056},
      {strategy,rss},
      {total,[{erlang,92205312},{rss,131547136},{allocated,129966080}]}]},
 {alarms,[]},
 {listeners,
     [{clustering,25672,"::"},
      {amqp,5672,"::"},
      {stomp,61613,"::"},
      {'http/web-stomp',15674,"::"},
      {mqtt,1883,"::"},
      {http,15672,"::"}]},
 {vm_memory_calculation_strategy,rss},
 {vm_memory_high_watermark,{absolute,"256MB"}},
 {vm_memory_limit,256000000},
 {disk_free_limit,50000000},
 {disk_free,7848427520},
 {file_descriptors,
     [{total_limit,1048476},
      {total_used,2},
      {sockets_limit,943626},
      {sockets_used,0}]},
 {processes,[{limit,1048576},{used,447}]},
 {run_queue,0},
 {uptime,229},
 {kernel,{net_ticktime,60}}]

rabbitmqctl environment

bash-4.4# rabbitmqctl environment
Application environment of node rabbit@10.244.0.26 ...
[{amqp10_client,[]},
 {amqp10_common,[]},
 {amqp_client,[{prefer_ipv6,false},{ssl_options,[]}]},
 {asn1,[]},
 {compiler,[]},
 {cowboy,[]},
 {cowlib,[]},
 {crypto,[{fips_mode,false}]},
 {goldrush,[]},
 {inets,[]},
 {jsx,[]},
 {kernel,
     [{error_logger,tty},
      {inet_default_connect_options,[{nodelay,true}]},
      {inet_dist_listen_max,25672},
      {inet_dist_listen_min,25672}]},
 {lager,
     [{async_threshold,20},
      {async_threshold_window,5},
      {colored,false},
      {colors,
          [{debug,"\e[0;38m"},
           {info,"\e[1;37m"},
           {notice,"\e[1;36m"},
           {warning,"\e[1;33m"},
           {error,"\e[1;31m"},
           {critical,"\e[1;35m"},
           {alert,"\e[1;44m"},
           {emergency,"\e[1;41m"}]},
      {crash_log,"log/crash.log"},
      {crash_log_count,5},
      {crash_log_date,"$D0"},
      {crash_log_msg_size,65536},
      {crash_log_size,10485760},
      {error_logger_format_raw,true},
      {error_logger_hwm,100},
      {error_logger_redirect,true},
      {extra_sinks,
          [{error_logger_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_channel_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_connection_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_mirroring_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_queue_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_federation_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_upgrade_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]}]},
      {handlers,
          [{lager_console_backend,
               [{formatter_config,
                    [date," ",time," ",color,"[",severity,"] ",
                     {pid,[]},
                     " ",message,"\n"]},
                {level,info}]}]},
      {log_root,"/var/log/rabbitmq"},
      {rabbit_handlers,
          [{lager_console_backend,
               [{formatter_config,
                    [date," ",time," ",color,"[",severity,"] ",
                     {pid,[]},
                     " ",message,"\n"]},
                {level,info}]}]}]},
 {mnesia,[{dir,"/var/lib/rabbitmq/mnesia/rabbit@10.244.0.26"}]},
 {os_mon,
     [{start_cpu_sup,false},
      {start_disksup,false},
      {start_memsup,false},
      {start_os_sup,false}]},
 {public_key,[]},
 {rabbit,
     [{auth_backends,[rabbit_auth_backend_internal]},
      {auth_mechanisms,['PLAIN','AMQPLAIN']},
      {autocluster,
          [{peer_discovery_backend,rabbit_peer_discovery_classic_config}]},
      {background_gc_enabled,false},
      {background_gc_target_interval,60000},
      {backing_queue_module,rabbit_priority_queue},
      {channel_max,0},
      {channel_operation_timeout,15000},
      {cluster_formation,
          [{peer_discovery_backend,rabbit_peer_discovery_k8s},
           {node_cleanup,
               [{cleanup_interval,10},{cleanup_only_log_warning,false}]},
           {peer_discovery_k8s,
               [{k8s_host,"kubernetes.default.svc.cluster.local"},
                {k8s_address_type,ip}]}]},
      {cluster_keepalive_interval,10000},
      {cluster_nodes,{[],disc}},
      {cluster_partition_handling,autoheal},
      {collect_statistics,fine},
      {collect_statistics_interval,5000},
      {config_entry_decoder,
          [{cipher,aes_cbc256},
           {hash,sha512},
           {iterations,1000},
           {passphrase,undefined}]},
      {connection_max,infinity},
      {credit_flow_default_credit,{400,200}},
      {default_consumer_prefetch,{false,0}},
      {default_permissions,[<<".*">>,<<".*">>,<<".*">>]},
      {default_user,<<"admin">>},
      {default_user_tags,[administrator]},
      {default_vhost,<<"/">>},
      {delegate_count,16},
      {disk_free_limit,50000000},
      {disk_monitor_failure_retries,10},
      {disk_monitor_failure_retry_interval,120000},
      {enabled_plugins_file,"/etc/rabbitmq/enabled_plugins"},
      {fhc_read_buffering,false},
      {fhc_write_buffering,true},
      {frame_max,131072},
      {halt_on_upgrade_failure,true},
      {handshake_timeout,10000},
      {heartbeat,60},
      {hipe_compile,true},
      {hipe_modules,
          [rabbit_reader,rabbit_channel,gen_server2,rabbit_exchange,
           rabbit_command_assembler,rabbit_framing_amqp_0_9_1,rabbit_basic,
           rabbit_event,lists,queue,priority_queue,rabbit_router,rabbit_trace,
           rabbit_misc,rabbit_binary_parser,rabbit_exchange_type_direct,
           rabbit_guid,rabbit_net,rabbit_amqqueue_process,
           rabbit_variable_queue,rabbit_binary_generator,rabbit_writer,
           delegate,gb_sets,lqueue,sets,orddict,rabbit_amqqueue,
           rabbit_limiter,gb_trees,rabbit_queue_index,
           rabbit_exchange_decorator,gen,dict,ordsets,file_handle_cache,
           rabbit_msg_store,array,rabbit_msg_store_ets_index,rabbit_msg_file,
           rabbit_exchange_type_fanout,rabbit_exchange_type_topic,mnesia,
           mnesia_lib,rpc,mnesia_tm,qlc,sofs,proplists,credit_flow,pmon,
           ssl_connection,tls_connection,ssl_record,tls_record,gen_fsm,ssl]},
      {lager_default_file,tty},
      {lager_extra_sinks,
          [rabbit_log_lager_event,rabbit_log_channel_lager_event,
           rabbit_log_connection_lager_event,rabbit_log_mirroring_lager_event,
           rabbit_log_queue_lager_event,rabbit_log_federation_lager_event,
           rabbit_log_upgrade_lager_event]},
      {lager_log_root,"/var/log/rabbitmq"},
      {lager_upgrade_file,tty},
      {lazy_queue_explicit_gc_run_operation_threshold,1000},
      {log,[{console,[{enabled,true}]}]},
      {loopback_users,[]},
      {memory_monitor_interval,2500},
      {mirroring_flow_control,true},
      {mirroring_sync_batch_size,4096},
      {mnesia_table_loading_retry_limit,10},
      {mnesia_table_loading_retry_timeout,30000},
      {msg_store_credit_disc_bound,{4000,800}},
      {msg_store_file_size_limit,16777216},
      {msg_store_index_module,rabbit_msg_store_ets_index},
      {msg_store_io_batch_size,4096},
      {num_ssl_acceptors,10},
      {num_tcp_acceptors,10},
      {password_hashing_module,rabbit_password_hashing_sha256},
      {plugins_dir,"/opt/rabbitmq/plugins"},
      {plugins_expand_dir,
          "/var/lib/rabbitmq/mnesia/rabbit@10.244.0.26-plugins-expand"},
      {proxy_protocol,false},
      {queue_explicit_gc_run_operation_threshold,1000},
      {queue_index_embed_msgs_below,4096},
      {queue_index_max_journal_entries,32768},
      {reverse_dns_lookups,false},
      {server_properties,[]},
      {ssl_allow_poodle_attack,false},
      {ssl_apps,[asn1,crypto,public_key,ssl]},
      {ssl_cert_login_from,distinguished_name},
      {ssl_handshake_timeout,5000},
      {ssl_listeners,[]},
      {ssl_options,[]},
      {tcp_listen_options,
          [{backlog,128},
           {nodelay,true},
           {linger,{true,0}},
           {exit_on_close,false}]},
      {tcp_listeners,[5672]},
      {trace_vhosts,[]},
      {vhost_restart_strategy,continue},
      {vm_memory_calculation_strategy,rss},
      {vm_memory_high_watermark,{absolute,"256MB"}},
      {vm_memory_high_watermark_paging_ratio,0.5}]},
 {rabbit_common,[]},
 {rabbitmq_amqp1_0,
     [{default_user,"guest"},
      {default_vhost,<<"/">>},
      {protocol_strict_mode,false}]},
 {rabbitmq_consistent_hash_exchange,[]},
 {rabbitmq_federation,
     [{internal_exchange_check_interval,30000},
      {pgroup_name_cluster_id,false}]},
 {rabbitmq_federation_management,[]},
 {rabbitmq_management,
     [{cors_allow_origins,[]},
      {cors_max_age,1800},
      {http_log_dir,none},
      {listener,[{ssl,false},{port,15672}]},
      {load_definitions,none},
      {management_db_cache_multiplier,5},
      {process_stats_gc_timeout,300000},
      {stats_event_max_backlog,250}]},
 {rabbitmq_management_agent,
     [{rates_mode,basic},
      {sample_retention_policies,
          [{global,[{605,5},{3660,60},{29400,600},{86400,1800}]},
           {basic,[{605,5},{3600,60}]},
           {detailed,[{605,5}]}]}]},
 {rabbitmq_mqtt,
     [{allow_anonymous,true},
      {default_user,<<"guest">>},
      {exchange,<<"amq.topic">>},
      {num_ssl_acceptors,1},
      {num_tcp_acceptors,10},
      {prefetch,10},
      {proxy_protocol,false},
      {retained_message_store,rabbit_mqtt_retained_msg_store_dets},
      {retained_message_store_dets_sync_interval,2000},
      {ssl_cert_login,false},
      {ssl_listeners,[]},
      {subscription_ttl,86400000},
      {tcp_listen_options,[{backlog,128},{nodelay,true}]},
      {tcp_listeners,[1883]},
      {vhost,<<"/">>}]},
 {rabbitmq_peer_discovery_common,[]},
 {rabbitmq_peer_discovery_k8s,[]},
 {rabbitmq_shovel,
     [{defaults,
          [{prefetch_count,1000},
           {ack_mode,on_confirm},
           {publish_fields,[]},
           {publish_properties,[]},
           {reconnect_delay,5}]}]},
 {rabbitmq_shovel_management,[]},
 {rabbitmq_stomp,
     [{default_user,[{login,<<"guest">>},{passcode,<<"guest">>}]},
      {default_vhost,<<"/">>},
      {hide_server_info,false},
      {implicit_connect,false},
      {num_ssl_acceptors,1},
      {num_tcp_acceptors,10},
      {proxy_protocol,false},
      {ssl_cert_login,false},
      {ssl_listeners,[]},
      {tcp_listen_options,[{backlog,128},{nodelay,true}]},
      {tcp_listeners,[61613]},
      {trailing_lf,true}]},
 {rabbitmq_web_dispatch,[]},
 {rabbitmq_web_stomp,
     [{cowboy_opts,[]},
      {num_ssl_acceptors,1},
      {num_tcp_acceptors,10},
      {port,15674},
      {ssl_config,[]},
      {tcp_config,[]},
      {use_http_auth,false},
      {ws_frame,text}]},
 {ranch,[]},
 {ranch_proxy_protocol,[{proxy_protocol_timeout,55000},{ssl_accept_opts,[]}]},
 {recon,[]},
 {sasl,[{errlog_type,error},{sasl_error_logger,tty}]},
 {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
 {stdlib,[]},
 {syntax_tools,[]},
 {xmerl,[]}]

AceHack commented 6 years ago

I would like to add the scenario described above creates a serious split brain problem where you end up with two isolated clusters acting independently even though they are listed as one unified cluster in K8s. This will cause traffic to get load balanced between all 5 nodes even though you actually have a 3 node and 5 node cluster respectively.

Just to explain the problem a little further.

First off 3 node cluster came up successfully with random erlang secret.

Then when scaling the statefulset to 5 nodes a new random erlang secret was created which caused the two new nodes to fail to join the original 3 nodes.

The two new nodes did, however, join each other in a new cluster.

This is basically the definition of the split brain problem

What is so serious about this is in K8s everything reports as good, no problems. This hides the problem and takes a fair amount of troubleshooting to actually discover what is going on.

michaelklishin commented 6 years ago

I’ve been observing many teams trying to agree on what constitutes a “healthy” node. It’s hilarious how little agreement there is. Regardless, peer discovery plugins use the health check mechanism provided by the backend, if any, and for Consul and etcd this is nothing more than periodic notifications that are only sent when the plugin is running, which in practice means when the node is running, which means it managed to rejoin the cluster (nodes that fail to rejoin eventually stop trying and fail).

Erlang cookie management and deployment tools are orthogonal to peer discovery backends and even node health checks. A node cannot know if it is in the “right” cluster. It also cannot know — with this backend anyway — how many nodes are supposed to be its peers at any given moment.

So your problem goes well beyond health check reporting.

michaelklishin commented 6 years ago

In this specific example, the only reason why nodes 4 and 5 did not stop after unsuccessfully trying to join nodes 1 through 3 is because they managed to join each other. A CLI command cannot solve this problem: it would have reported success for all 5 nodes individually because they are clustered with a peer. As mentioned above, with this discovery method nodes cannot know how many peers are supposed to be there and what they are.

The fundamental problem here is Erlang cookie management, not health checks or health reporting of individual nodes to the discovery backend.

rabbitmqctl node_health_check already does enough checks to be a good starting point that many teams managed to agree on.

The only check I can think of for backends such as this one is: list all peer members and assert that there are at least N of them. That sounds like a good starting point worth discussing on the list first.

michaelklishin commented 6 years ago

I’m going to add a doc section in the cluster formation guide about this scenario.

I suspect it should be possible to use the classic config backend on Kubernetes just fine since it only requires a RabbitMQ config file entry. With that backend and DNS the number of nodes is known ahead of time and is assumed to be fixed. The downside of this is, well, that it is fixed and that node hostnames must be known ahead of time.

AceHack commented 6 years ago

Can you give me the link to list rabbitmq-users? I'm not aware where it's located. Some form of distributed consensus on leader election and cluster membership should be able to avoid the split brain problem even in dynamically sized clusters, it's a pretty well-solved problem now a day. Also, I would like to disagree that the fundamental problem is Erlang cookie management. Cluster membership health is the fundamental problem here. No matter how the external environment is setup, bad Erlang cookies or not, it's important to be able to report correct status of cluster membership.

lukebakken commented 6 years ago

Hi Aaron - the mailing list is located here.

AceHack commented 6 years ago

@michaelklishin This same problem happens even when all cookies are the same and there is no cookie problem. It happens when there is a network partition separating the 2 nodes from the 3 nodes. Two different clusters form, causing split brain.

michaelklishin commented 6 years ago

@AceHack I find it hard to believe. Peer discovery is not involved in partition handling in any way and what you claim to happen is not reported elsewhere, especially when all nodes have the same cookie. This involves cases that use rabbitmq-autocluster, which has been around for years and has had Kubernetes discovery support for over a year IIRC. I suspect the real issue here is general confusion about how RabbitMQ clusters operate.

michaelklishin commented 6 years ago

There is no split brain problem when clusters are resized and the cookie is the same. Newly added nodes will join the existing cluster or fail, unless they can discover a different set of nodes to join. Removing nodes is even more trivial. We've seen all of those scenarios with different automation tools, in particular BOSH, and Kubernetes is in no way special.

What rabbitmq-autocluster added (and now this plugin) is initial cluster member discovery. Nothing else has changed.

Gsantomaggio commented 6 years ago

Hi @AceHack

History:

The images pivotalrabbitmq/rabbitmq-autocluster:3.7.XXX were created only to make the examples easy.
The intention is/was not to use it in production.
I created the pivotalrabbitmq/rabbitmq-autocluster:3.7.XXX images because we weren't sure to ship the rabbitmq-peer-discovery.xxx plugins with the setup.

Current situation:

rabbitmq-peer-discovery.xxx plugins are already in the RabbitMQ 3.7.0 setup
I did not publish the pivotalrabbitmq docker file because I was waiting for the official docker image 3.7.0

Next steps:

Modify the examples by using the official RabbitMQ image. I had to do this a few weeks ago, but I didn't have time to do that. I'm sorry for the confusion created.
with the official Docker images the users are free to modify the deploy, but these are just examples, so cookies problems etc are out of scope.

Hope it is clear now.

Gsantomaggio commented 6 years ago

@AceHack FYI we changed from pivotal docker image to official docker image. see: https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/pull/13 Here you can find the new example.

rabbitmq / rabbitmq-peer-discovery-k8s

k8s_statefulsets livenessProbe and readinessProbe incorrectly use command: ["rabbitmqctl", "status"] #12