rabbitmq / rabbitmq-peer-discovery-k8s

Kubernetes-based peer discovery mechanism for RabbitMQ
Other
296 stars 94 forks source link

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain" #55

Closed taijitao closed 4 years ago

taijitao commented 4 years ago

Hi, I had a pure ipv6 k8s cluster. and i want to instal rabbitmq helm chart. I followed the instrument in https://www.rabbitmq.com/networking.html#distribution-ipv6 My parameter(in helm chart):

   environment: |-
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc'  -proto_dist inet6_tcp"
      RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp "
  erl_inetrc: |-
    {inet6, true}.

File erl_inetrc was created under /etc/rabbitmq. and I found error in log:

2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized start
up delay.
2019-10-15 07:33:55.000 [debug] <0.238.0> GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/tazou/endpoints/zt4-crmq
2019-10-15 07:33:55.015 [debug] <0.238.0> Response: {error,{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet]
,nxdomain}]}}
2019-10-15 07:33:55.015 [debug] <0.238.0> HTTP Error {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet],nxdom
ain}]}
2019-10-15 07:33:55.015 [info] <0.238.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}
,
                 {inet,[inet],nxdomain}]}
2019-10-15 07:33:55.016 [error] <0.237.0> CRASH REPORT Process <0.237.0> with 0 neighbours exited with reason: no case clause matching {error,"{fa
iled_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from
_config/0 line 167 in application_master:init/4 line 138
2019-10-15 07:33:55.016 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kub
ernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 167

the inet could return ipv6 address.

[root]# kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet:gethostbyname("kubernetes.default.svc.cluster.local", inet6).'
{ok,{hostent,"kubernetes.default.svc.cluster.local",[],inet6,16,
             [{64769,43981,0,0,0,0,0,1}]}}
[root]#  kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet_res:resolve("kubernetes.default.svc.cluster.local", in, aaaa).'
{ok,{dns_rec,{dns_header,1,true,query,true,false,true,true,false,0},
             [{dns_query,"kubernetes.default.svc.cluster.local",aaaa,in}],
             [{dns_rr,"kubernetes.default.svc.cluster.local",aaaa,in,0,5,
                      {64769,43981,0,0,0,0,0,1},
                      undefined,[],false}],
             [],[]}}

nslookup return ipv6 address when type=aaaa. return error when type=a.

I don't know why httpc:request will return nxdomain. is it a bug or setting issue?

B.R, Tao

taijitao commented 4 years ago

does this plugin support ipv6 only stack or it support ipv6/ipv4 stack?

michaelklishin commented 4 years ago

This plugin issues requests to the Kubernetes API over HTTP[S]. It is entirely unaware of what IP version is used underneath. nxdomain, as I'm sure you know, means "no domain resolved". This plugin cannot be responsible for that.

For cases when proper hostname resolution configuration is not available, Erlang provides its own resolution configuration file which should be pointed at using the ERL_INETRC environment variable. You don't need it most of the time but sometimes it is indispensable.

lukebakken commented 4 years ago

Versions of the software from this rabbitmq-users discussion:

rabbitmq_3.7.18-1.el7
erlang_22.0.7-1.el7

I suspect this is due to the httpc library defaulting to inet: docs.

Note the default value for IpFamily.

@taijitao since you have access to an IPv6-only environment, I will create a custom build of this plugin for you to test.

lukebakken commented 4 years ago

@taijitao - here is the custom plugin built from this branch:

rabbitmq_peer_discovery_k8s-3.7.20+rc.1.dirty.ez.zip

To install:

Please note that cluster formation only happens the first time RabbitMQ is started. If these nodes have been started before, you will have to reset them (rabbitmqctl reset) or delete their data directory.

lukebakken commented 4 years ago

@taijitao any chance to test this? ^^^^

taijitao commented 4 years ago

Yes, I'll test that. Could you give me some explaination what you have changed in the customize build?

michaelklishin commented 4 years ago

@taijitao it configures (unconditionally at the moment) HTTP client's socket address family to IPv6.

taijitao commented 4 years ago

I have tested it and it worked. the erlang setting is : {inet6, true}. good news is :

2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery Kubernetes: setting IpFamily to inet6...
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery Kubernetes: setting IpFamily to inet6 response: ok
2019-10-22 06:10:28.934 [info] <0.274.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-22 06:10:28.934 [info] <0.274.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-10-22 06:10:29.016 [info] <0.274.0> All discovered existing cluster peers: rabbit@zt2-crmq-1, rabbit@zt2-crmq-0
2019-10-22 06:10:29.016 [info] <0.274.0> Peer nodes we can cluster with: rabbit@zt2-crmq-0
2019-10-22 06:10:29.032 [warning] <0.274.0> Could not auto-cluster with node rabbit@zt2-crmq-0: {badrpc,nodedown}

but it's fail to form cluser. I now had two separated nodes. docker process bash-4.2$ ps -ef

UID        PID  PPID  C STIME TTY          TIME CMD
rabbitmq     1     0  0 06:09 ?        00:00:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server start
rabbitmq   197     1  0 06:09 ?        00:00:00 /usr/lib64/erlang/erts-10.4.4/bin/epmd -daemon
rabbitmq   383     1  1 06:09 ?        00:00:18 /usr/lib64/erlang/erts-10.4.4/bin/beam.smp -W w -A 64 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048
rabbitmq   551   383  0 06:10 ?        00:00:00 erl_child_setup 1048576
rabbitmq  1894   551  0 06:10 ?        00:00:00 inet_gethost 4
rabbitmq  1895  1894  0 06:10 ?        00:00:00 inet_gethost 4
rabbitmq  9563     0 35 06:26 ?        00:00:00 /usr/lib64/erlang/erts-10.4.4/bin/beam.smp -B -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -boot star
rabbitmq  9676  9563 34 06:26 ?        00:00:00 erl_child_setup 1048576
rabbitmq  9697     0  2 06:26 ?        00:00:00 bash
rabbitmq  9706  9697  0 06:26 ?        00:00:00 ps -ef
michaelklishin commented 4 years ago

According to the log discovery via Kubernetes API endpoint has succeeded. However, nodes could not contact and/or authenticate with each other. This is not a responsibility of this plugin. See rabbit@zt2-crmq-0 logs for more clues. This part of the discussion is mailing list material.

michaelklishin commented 4 years ago

See Using IPv6 for Inter-node and CLI Tool Communication.

michaelklishin commented 4 years ago

httpc can only use one address family for its sockets. So we have a couple of options:

I personally would prefer the latter. @taijitao WDYT?

Gsantomaggio commented 4 years ago

Hi, I had a k8s configured in pure IPv6 ( with Kind ).

I tried this patch because I'd need also here. It seems to work correctly:

[vagrant@localhost k8s_statefulsets]$ kubectl get pod -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP                NODE                 NOMINATED NODE   READINESS GATES
rabbitmq-0             1/1     Running   0          9m59s   fd00:10:244::27   kind-control-plane   <none>           <none>
rabbitmq-1             1/1     Running   0          8m43s   fd00:10:244::28   kind-control-plane   <none>           <none>
rabbitmq-2             1/1     Running   0          7m51s   fd00:10:244::29   kind-control-plane   <none>           <none>

and:

 kubectl describe service rabbitmq
Name:                     rabbitmq
Namespace:                default
Labels:                   app=rabbitmq
Annotations:              kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"rabbitmq"},"name":"rabbitmq","namespace":"default"},"spe...
Selector:                 app=rabbitmq
Type:                     NodePort
IP:                       fd00:10:96::99a8
Port:                     http  15672/TCP
TargetPort:               15672/TCP
NodePort:                 http  31672/TCP
Endpoints:                [fd00:10:244::27]:15672,[fd00:10:244::28]:15672,[fd00:10:244::29]:15672
Port:                     amqp  5672/TCP
TargetPort:               5672/TCP
NodePort:                 amqp  30672/TCP
Endpoints:                [fd00:10:244::27]:5672,[fd00:10:244::28]:5672,[fd00:10:244::29]:5672
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

also the cluster status:

 rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local ...
Basics

Cluster name: rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local

Disk Nodes

rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local

Running Nodes

rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local
rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local

I noticed that for some reason the command check_port_connectivity does not work correctly in this stack:

 rabbitmq-diagnostics check_port_connectivity
Testing TCP connections to all active listeners on node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local ...
Error:
Connection to ports of the following listeners on node rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local failed:
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
lukebakken commented 4 years ago

@michaelklishin working on a PR to fix this in an "auto detect" fashion

taijitao commented 4 years ago

thanks lukebakken for your help. it's better to 'auto detect' than to switch between different binary plugin. Now cluster is created based on your private build.

michaelklishin commented 4 years ago

Auto-detection has a tendency to fail in ways that are hard to understand. There will be no switching between binary plugins if we can't get auto-detection to work reliably but an option that lets the operator to tell the plugin what AF to use.

taijitao commented 4 years ago

that's fine if one option is provided. is it in the erl_inetrc? or in plugin configuration?

lukebakken commented 4 years ago

@taijitao @Gsantomaggio if you have time, I would really appreciate you testing the fix in https://github.com/rabbitmq/rabbitmq-peer-discovery-common/pull/11

rabbitmq_peer_discovery_common-3.7.20+rc.1.2.gb768f10.ez.zip

The changes in https://github.com/rabbitmq/rabbitmq-peer-discovery-common/pull/11 look for the presence of {inet6, true} in your inetrc file and will set the appropriate httpc option if found.

hustlzp1981 commented 4 years ago

@taijitao @lukebakken Could you help take a look at my issue, thanks a lot! I have tried as you mentioned above and other methods. The rabbitmq pod always failed with below error in my IPV6 setup. ERROR: epmd error for host osh-openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local: nxdomain (non-existing domain)

1) I added below in configmap-etc.yaml environment: |- RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc' -proto_dist inet6_tcp" RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp" erl_inetrc: |- {inet6, true}. 2) In my armada manifest, pull image: rabbitmq: docker.io/rabbitmq:3.7.24

Thanks! Zhipeng

hustlzp1981 commented 4 years ago

@lukebakken Do I need your patch? Has your patch been merged into some release(3.7.24 or later ) Thanks!!! Zhipeng

michaelklishin commented 4 years ago

@hustlzp1981 see the milestone on this PR and 3.7.20 release notes?

michaelklishin commented 4 years ago

@hustlzp1981 this is not a support forum. Please post your questions to the mailing list.

nxdomain means that the hostname (osh-openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local) failed to resolve. This PR simply makes the HTTP client use IPv6 if it is configured via ERL_INETRC. There must be an AAAA DNS record in place or the client won't be able to resolve it.

hustlzp1981 commented 4 years ago

Thanks klishin! Could you tell me which mailing list I should use?

michaelklishin commented 4 years ago

RabbitMQ has only one and it hasn't changed since 2014.

Gsantomaggio commented 4 years ago

The nxdomain is a common problem in k8s, maybe we should update the documentation to add this document , this document, and add some specific example for rabbitmq.

hustlzp1981 commented 4 years ago

Thanks! Now I fixed nxdomain issue in my ipv6 k8s setup according to above guide. osh-openstack-rabbitmq-cluster-wait-9rw6p 1/1 Running 0 17m osh-openstack-rabbitmq-rabbitmq-0 1/1 Running 0 17m

However, still have another issue. In pod osh-openstack-rabbitmq-cluster-wait, it will use rabbitmqadmin to connect rabbitmq but always get error. It can work in my ipv4 setup. ++ active_rabbit_nodes 2020-03-17T10:31:12.124589385Z stderr F ++ wc -w 2020-03-17T10:31:12.134367271Z stderr F ++ rabbitmqadmin_authed list nodes -f bash 2020-03-17T10:31:12.134427089Z stderr F ++ set +x 2020-03-17T10:31:12.179073378Z stderr F Traceback (most recent call last): 2020-03-17T10:31:12.179644557Z stderr F error: [Errno 111] Connection refused 2020-03-17T10:31:12.17964969Z stderr F *** Could not connect: [Errno 111] Connection refused

michaelklishin commented 4 years ago

Could not connect: [Errno 111] Connection refused is specific enough: a TCP connection (presumably to the HTTP API endpoint) was refused.

michaelklishin commented 4 years ago

This is not a Kubernetes support forum so I will lock this.