kubesphere-sigs / fluent-operator-walkthrough

Fluent Operator Walkthrough
Apache License 2.0
28 stars 13 forks source link

The walkthrough doesn't seem to work on Apple Silicon OS X #35

Open jeff303 opened 6 months ago

jeff303 commented 6 months ago

Hi all!

I went through the walkthough on my M3 MacBook Pro, and ran into a few issues.

First, when attempting to run the ./create-minikube-cluster-for-mac.sh, I got this error.

Warning: kubernetes-cli 1.29.3 is already installed and up-to-date.
To reinstall 1.29.3, run:
  brew reinstall kubernetes-cli
😄  minikube v1.32.0 on Darwin 14.4 (arm64)
❗  minikube skips various validations when --force is supplied; this may lead to unexpected behavior
✨  Using the hyperkit driver based on user configuration

❌  Exiting due to DRV_UNSUPPORTED_OS: The driver 'hyperkit' is not supported on darwin/arm64

error: no context exists with the name: "minikube"

It seems that the hyperkit driver is not available on Apple Silicon. So I tried starting Minikube without that option (just with minikube start), and got farther.

However, when trying to deploy fluentd using this step, the pod failed to start (CrashLoopBackOff). Checking the logs reveals

exec /fluentd/bin/fluentd-watcher: exec format error

So I think that the fluentd-watcher image that it's trying to use is incompatible with my architecture. However, although I'm relatively familiar with Kubernetes, I'm not as familiar with Minikube, nor dealing with Apple Silicon related architecture issues. Anyone have pointers on how I might be able to work around this?

Thank you!

jeff303 commented 6 months ago

OK, I had a little bit of help on this. Turns out, I just need to define the Fluentd CRD with the arm64 arch variant and it starts up fine.

  image: kubesphere/fluentd:v1.14.6-arm64

However, the process still seems to fail

/fluentd/bin/fluentd-watcher
level=error msg="start Fluentd error" error="fork/exec /usr/bin/fluentd: no such file or directory"
level=info msg=backoff delay=1s
level=info msg="backoff timer done" actual=1.001294876s expected=1s
level=error msg="start Fluentd error" error="fork/exec /usr/bin/fluentd: no such file or directory"

I tried running the container as root, and symlinking that path, and that seemed to fix it, but not sure how to persist this to the CRD.

docker run -u root --entrypoint bash -it kubesphere/fluentd:v1.14.6-arm64
# within the container
ln -s /usr/local/bundle/bin/fluentd /usr/bin/fluentd
/fluentd/bin/fluentd-watcher
level=info msg="Fluentd started"
2024-03-22 14:03:58 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2024-03-22 14:03:58 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
...

Edit: after reading the source here, I found a cleaner way to do this that doesn't require running as root.

docker run kubesphere/fluentd:v1.14.6-arm64 -b /usr/local/bundle/bin/fluentd

Unfortunately, I can't find any way to get this to persist in the statefulset, since it gets overwritten immediately (presumably by the operator). And the fluentd CRD doesn't seem to be able to take these options anywhere (unless I missed it).

Edit 2: OK, that can be persisted via args in the CRD!

spec:
  args:
  - -b
  - /usr/local/bundle/bin/fluentd

With these changes, everything seems to be running on my M3.

jeff303 commented 6 months ago

OK, it seems that the forward part isn't working at all.

[2024/03/25 21:53:43] [error] [output:forward:forward.0] no upstream connections available
[2024/03/25 21:53:43] [error] [output:forward:forward.0] no upstream connections available
[2024/03/25 21:53:43] [ warn] [engine] failed to flush chunk '93-1711403618.673907638.flb', retry in 9 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2024/03/25 21:53:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2024/03/25 21:53:43] [ warn] [output:es:es.1] http_do=-1 URI=/_bulk
[2024/03/25 21:53:43] [ warn] [engine] failed to flush chunk '93-1711403618.678663555.flb', retry in 8 seconds: task_id=5, input=tail.0 > output=forward.0 (out_id=0)
[2024/03/25 21:53:43] [ warn] [engine] failed to flush chunk '93-1711403618.670578263.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=es.1 (out_id=1)
[2024/03/25 21:53:43] [error] [http_client] broken connection to elasticsearch-master.elastic.svc:9200 ?
[2024/03/25 21:53:43] [ warn] [output:es:es.1] http_do=-1 URI=/_bulk
[2024/03/25 21:53:43] [ warn] [engine] failed to flush chunk '93-1711403618.673907638.flb', retry in 7 seconds: task_id=3, input=tail.0 > output=es.1 (out_id=1)
[2024/03/25 21:53:43] [error] [http_client] broken connection to elasticsearch-master.elastic.svc:9200 ?
[2024/03/25 21:53:43] [ warn] [output:es:es.1] http_do=-1 URI=/_bulk
[2024/03/25 21:53:43] [ warn] [engine] failed to flush chunk '93-1711403618.678663555.flb', retry in 7 seconds: task_id=5, input=tail.0 > output=es.1 (out_id=1)
[2024/03/25 21:53:44] [error] [output:forward:forward.0] no upstream connections available

Not sure why the Fluent bit instance is not able to connect to the fluentd service for forwarding. I went through some of the DNS troubleshooting stuff here, and that doesn't seem to be having any problems. I almost wonder if fluentd is not listening on the configured forward input port, for whatever reason. I don't see any kind of log message that might indicate it has started listening (though I haven't dug through the code enough to know whether such a message is expected).

jeff303 commented 6 months ago

Starting from scratch, and documenting one more thing I had to fix.

The logs from the Kafka operator show

2024-03-26 22:04:24 ERROR AbstractOperator:284 - Reconciliation #4(timer) Kafka(kafka/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.model.KafkaVersion$UnsupportedKafkaVersionException: Unsupported Kafka.spec.kafka.version: 3.1.0. Supported versions are: [3.6.0, 3.6.1, 3.7.0]
        at io.strimzi.operator.cluster.model.KafkaVersion$Lookup.supportedVersion(KafkaVersion.java:180) ~[io.strimzi.cluster-operator-0.40.0.jar:0.40.0]
        at io.strimzi.operator.cluster.operator.assembly.ZooKeeperVersionChangeCreator.<init>(ZooKeeperVersionChangeCreator.java:85) ~[io.strimzi.cluster-operator-0.40.0.jar:0.40.0]

So I did kubectl edit -n kafka kafka to change version to 3.6.1 and inter.broker.protocol.version to 3.6, and then Kafka started successfully.

jeff303 commented 6 months ago

It seemed that nothing was getting indexed into the Elasticsearch instance at all. The logs showed tons of messages like the following

{"@timestamp":"2024-03-27T00:26:17.260Z", "log.level": "WARN", "message":"received plaintext http traffic on an https channel, closing connection Netty4HttpChannel{localAddress=/10.244.0.26:9200, remoteAddress=/10.244.0.18:49076}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-master-0][transport_worker][T#1]","log.logger":"org.elasticsearch.xpack.security.transport.netty4.SecurityNetty4HttpServerTransport","elasticsearch.cluster.uuid":"ldQm9yJlRhqemKCC9XZDjQ","elasticsearch.node.id":"yajs6batT8GloQL9XCUP2g","elasticsearch.node.name":"elasticsearch-master-0","elasticsearch.cluster.name":"elasticsearch"}

I think that the security settings need to be disabled if we want Fluent pods to be able to send message to it (or else, they need to be configured with security).

I tried kubectl edit statefulset -n elastic elasticsearch-master to tweak all of the xpack.security.* settings, to no avail (I'm not able to get the pod to startup again in healthy state after changing them).

jeff303 commented 5 months ago

OK, at least some of these issues are handled by updating to a much newer version of the Fluent operator. Opened #36 for that.

jamiejackson commented 1 month ago

I ran into the same issues with TLS and credentials. I have my doubts that this repo is maintained, since the instructions don't yield a working configuration and there have been no commits in over two years