Closed codex70 closed 1 year ago
Hi @codex70 - Thanks for filing this issue and i'm sorry that you're having trouble deploying.
Would you be able to provide a bit more information about your configuration such as the exact values.yaml
files you're using for both Consul and ECK? I was able to roughly follow your steps and it appears that everything comes online without Consul crashing. Although I suspect the lack of a Kubernetes Service in the Beat
deployment will cause it to fail to connect to the mesh.
Here's what I did:
helm install consul hashicorp/consul -f values.yaml
demo $ cat val.yaml
global:
datacenter: "dc1"
name: consul
tls:
enabled: true
enableAutoEncrypt: true
acls:
manageSystemACLs: true
server:
replicas: 1
connectInject:
replicas: 1
enabled: true
controller:
enabled: true
ui:
enabled: true
service:
enabled: true
type: LoadBalancer
* `helm install elastic-operator elastic/eck-operator -n elastic-system --create-namespace -f eck.yaml`
```eck.yaml
demo $ cat eck.yaml
podAnnotations: {
consul.hashicorp.com/connect-inject: "true",
consul.hashicorp.com/connect-service: "elastic-operator"
}
demo $ k apply -f filebeat.yaml
beat.beat.k8s.elastic.co/filebeat unchanged
demo $ k get beat
NAME HEALTH AVAILABLE EXPECTED TYPE VERSION AGE
filebeat red filebeat 7m19s
I suspect there are some other issues to resolve before you'll be able to make this configuration work, namely that you'll need a service (headless if necessary) that fronts your filebeat service named filebeat
as well as a serviceaccount created:
[from the daemonset]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 81s (x18 over 12m) daemonset-controller Error creating: pods "filebeat-beat-filebeat-" is forbidden: error looking up service account default/filebeat: serviceaccount "filebeat" not found
Hi @kschoche,
Thanks for getting back to me so quickly.
The configuration for Elastic Operator is just the podAnnotations as you have above, the rest is default configuration. Elasticsearch itself is deployed in a separate cluster and the plan is to use Filebeat
to connect over the service mesh.
There is a service account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: Filebeat
And the headless service:
apiVersion: v1
kind: Service
metadata:
name: filebeat
labels:
app: filebeat
spec:
selector:
app: filebeat
clusterIP: None
Consul is slightly more complicated as we're running a federated cluster, as previously mentioned the plan is to use Filebeat to send logs to the primary cluster:
consul:
global:
name: consul
datacenter: dc2
metrics:
enabled: true
tls:
enabled: true
caCert:
secretName: consul-federation
secretKey: caCert
caKey:
secretName: consul-federation
secretKey: caKey
federation:
enabled: true
primaryDatacenter: dc1
acls:
manageSystemACLs: true
replicationToken:
secretName: consul-federation
secretKey: replicationToken
gossipEncryption:
secretName: consul-gossip-encryption-key
secretKey: key
connectInject:
enabled: true
default: false
controller:
enabled: true
meshGateway:
enabled: true
client:
exposeGossipPorts: true
server:
extraVolumes:
- type: secret
name: consul-federation
items:
- key: serverConfigJSON
path: config.json
load: true
exposeGossipAndRPCPorts: true
ports:
serflan:
port: 9301
apiGateway:
enabled: true
image: "hashicorp/consul-api-gateway:0.3.0"
logLevel: debug
managedGatewayClass:
enabled: true
serviceType: LoadBalancer
useHostPorts: true
copyAnnotations:
service:
annotations: |
- external-dns.alpha.kubernetes.io/hostname
- external-dns.alpha.kubernetes.io/ttl
syncCatalog:
enabled: true
default: true
toConsul: true
toK8S: true
syncClusterIPServices: false
The rest is exactly as you have it above. I would be interested to know how you get on just adding the service account and headless service. I might be able to create a separate cluster later today to try a really simple setup and see if I get the same problem.
As I say, the consul configuration is a little complicated especially as this is a secondary datacenter in a federated cluster, but I suspect this isn't the problem.
From what I understand, a service with an advertised port is required for consul connect injector to configure envoy proxy and register the service in consul. This is required to communication to happen on the service mesh, even if it is not listening on any port.
OK, so how can a service that doesn't have an advertised port communicate in the service mesh. There are plenty of reasons why a service may not need to be open to communication, but would need to connect with a service in the mesh (any type of job like process). The only way I have been able to get it to work is to create an ingress, which seems to defeat the point.
@codex70 This is exactly my thoughts as well, because there are such a thing as a pure client scenario. Regardless, this is a hard requirement by Consul.
The good news is that, the service does not actually need to be listening on the port, at least that is my understanding from the example in the docs. In that example, the service shows something listening on port 80
, but in reality, nothing is listening as the pod runs bin/sh -c -- while true; do sleep 30; done;
.
The best practices for cloud-native would be to run a small web service (like flask, sinatra, express, etc) for a health check. But alas, this is not necessary given their docs.
The good news is that, the service does not actually need to be listening on the port, at least that is my understanding from the example in the docs. In that example, the service shows something listening on port
80
, but in reality, nothing is listening as the pod runsbin/sh -c -- while true; do sleep 30; done;
.
Thats a really good idea and I will give that a try some time over the next couple of days and report back. The big issue I have is that it destroys the entire cluster if it doesn't work, so will probably have to create a new cluster where I can test things.
@darkn3rd I've tried again making sure the service shows as listening on port 80:
apiVersion: v1
kind: Service
metadata:
name: filebeat-test
labels:
app: filebeat-test
spec:
selector:
app: filebeat-test
clusterIP: None
ports:
- port: 80
But unfortunately it still completely destroys the entire cluster. I've updated to latest versions of consul etc. so it appears that this is still very much an issue.
@kschoche it would be good to know if anyone at Consul is able to replicate the problem
I am curious what are the k8s services that are finally rendered from the Beat CRD. Unless someone at Hashicorp is familiar with this operator and CRD, it'll be harder for them to spot any issues. I am curious specifically what's rendered between a service and the headless service. For example, if they share the same ports, consul will fail. Also, check out the logs from the various consul components, and check resources as well, as consul doesn't emit to logs it is out of resources, it just spins forever in stuck state. Without playing with the Beat CRD myself, I wouldn't know where to look per say. Sounds like exciting project.
@codex70 I found this while exploring Consul versus some other service meshes and had also recently found this discussion, which made me wonder if in your federated environment part of the issue is a "filebeat" service in another cluster in the same mesh, being treated as "the same service" (whatever that would entail)?
@jrhunger. it was a good idea to check, but in my case, the filebeat services do have different names in each environment.
I suspect the real problem I'm having is that, by default there is no port, or endpoint that filebeat exposes. I've just checked though, and it looks like it might be possible to expose an endpoint for metrics. I will take a look and see if that would fix the issue. My big problem is that if I get it wrong, it destroys the entire Consul cluster in a totally unrecoverable way. In the end I have to completely destroy all kubernetes clusters and start again from scratch. This is obviously a real problem for Consul.
@codex70 Maybe the issue is your filebeat service is using hostNetwork: true ? Maybe then when Consul CNI sets the iptables rules to route all pod traffic through the Envoy sidecar, that it's doing it for the whole host due to lack of network isolation?
@jrhunger, the filebeat service is super simple:
apiVersion: v1
kind: Service
metadata:
name: filebeat
labels:
app: filebeat
spec:
selector:
app: filebeat
ports:
- name: http
port: 5506
targetPort: 5506
protocol: TCP
I found out that it's possible to setup filebeat with an endpoint for metrics, so used this port (5506) as the port and targetport for the service. This allows me to create a simple service which can be clusterip.
If I don't add connect inject = true for the pods annotations, this seems to work, however as soon as I try to add the filebeat service it completely destroys the whole kubernetes cluster.
It starts with all ingress controllers failing with a bad gateway 502, then eventually I get site cannot be reached.
At this point, about half of the pods on the cluster enter into a CrashLoopBackOff. It appears that anything trying to connect with Consul is unable to do so:
"error":"Get \"https://consul.service.consul:8501/v1/agent/connect/ca/roots\": dial tcp x.x.x.x:8501: connect: connection refused
@codex70 this is the part i'm talking about, in the filebeat daemonset yaml:
spec:
automountServiceAccountToken: true
serviceAccount: filebeat
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
You're running a filebeat on every node, and it is using hostNetwork, so i think injecting the consul envoy pod (and associated CNI activity) is probably causing all traffic (maybe including that from the other envoy proxies) to pass through the filebeat envoy. You could inspect the iptables rules after deployment to confirm.
@jrhunger well spotted, that's it. You've explained it perfectly and now I understand why it was causing problems.
It looks like I don't necessarily need:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
Now I need to find out what the implications are of removing it, there may be some further setup required for Filebeat, but I now know what was happening and how to fix it!
Once again, massive thanks.
Community Note
Overview of the Issue
Filebeat runs without a service so in order to add filebeat to the service mesh, it is necessary to create a headless service. As soon as this headless service is added to the service mesh, all Consul pods stop working and consequently any service mesh services also fail leaving the entire environment broken.
The only way I have found to fix this is to completely destroy the kubernetes clusters and start again from scratch. Obviously this is highly undesirable.
Reproduction Steps
Install Filebeat using the following filebeat.yaml file:
output from 'kubectl logs' in relevant components