Closed Dec- closed 6 years ago
I never saw this kind of error before. It is quite strange that it happened on all brokers. What kind of storage do you use?
We are using NFS
Ok. I wonder if that might be the problem. TBH I never tried Kafka with NFS, so I do not know how well it works. How are the NFS volumes provisioned for the different pods? Are you sure that each broker instance has its own NFS volume / its own path on shared volume? I was wondering if they are not overwriting each others files.
So what kind of storage you recommend?
I guess it depends what do you have available :-). Local storage is one of the options - the availability can be handled by replication. In AWS for example, the EBS volumes should work reasonably well.
Still the same problem with the new version (0.4.0). Can it be a configuration problem? Or we really need to use Gluster as storage?
AFAIK we have other users using Gluster. So if you have it available, you can definitely give it a try. I tried some googling, but found these slides which list NFS issues. From my experience, NFS is not good match for this kind of system performance wise, but I wouldn't have expected any errors like this. Unfortunately, I do not have any NFS volume available right now to give it a try.
Just for the record ... Gluster normally uses the same network for the storage as is used for the Kafka communication. This increases the network load significantly over using a storage with dedicated network. But leaving the performance aside it should work.
We tried with local disk and we got the same error. Can it be configuration problem?
That is strange. Could you please share the ConfigMaps you use to deploy the cluster?
Here it is
apiVersion: v1 data: kafka-config: |- { "default.replication.factor": 1, "offsets.topic.replication.factor": 3, "transaction.state.log.replication.factor": 3 } kafka-healthcheck-delay: '15' kafka-healthcheck-timeout: '5' kafka-metrics-config: |- { "lowercaseOutputName": true, "rules": [ { "pattern": "kafka.server<type=(.+), name=(.+)PerSec\\w*><>Count", "name": "kafka_server_$1_$2_total" }, { "pattern": "kafka.server<type=(.+), name=(.+)PerSec\\w*, topic=(.+)><>Count", "name": "kafka_server_$1_$2_total", "labels": { "topic": "$3" } } ] } kafka-nodes: '3' kafka-storage: '{ "type": "persistent-claim", "size": "10Gi", "delete-claim": false }' topic-operator-config: '{ }' zookeeper-healthcheck-delay: '15' zookeeper-healthcheck-timeout: '5' zookeeper-metrics-config: |- { "lowercaseOutputName": true } zookeeper-nodes: '3' zookeeper-storage: '{ "type": "persistent-claim", "size": "1Gi", "delete-claim": false }' kind: ConfigMap metadata: creationTimestamp: '2018-05-21T21:21:52Z' labels: app: strimzi-dpp strimzi.io/kind: cluster strimzi.io/type: kafka name: strimzi-dpp namespace: strimzi resourceVersion: '8886565' selfLink: /api/v1/namespaces/strimzi/configmaps/strimzi-dpp uid: fb3f543d-5d3c-11e8-9fe5-5254008931bc
And one more question do we really need log dir? Can we let fluend do his job and disable loging on file system?
This looks completely normal ... :-(
ephemeral
storage type?As for the other question ... the log dir referred here is not for logs as in log files which fluentd consumes. It is the directory where the journals where the messages send to Kafka will be stored (= message log).
We are using OpenShift on-prem 3.9, we tried standard provisioning with NFS and than (https://docs.openshift.com/container-platform/3.7/install_config/configuring_local.html)
One more thing we find out is that a few minutes before broker is down, there is big spike in memory, from 2Gb to 9Gb, Can it be connected? :)
Btw Thx for helping. :+1:
It looks like we just didn't have enough memory on the system... :) Our kafka brokers constantly using 12Gb of memory is there a way to optimized it?
I'm sorry ... I was travelling and didn't got to this yet.
You should be able to configure the Kubernetes resources in the config map: http://strimzi.io/docs/0.4.0/#resources_json_config. You can also configure the JMV -Xmx
and -Xms
options: http://strimzi.io/docs/0.4.0/#jvm_json_config.
TBH, I'm not sure this is related, but give it a try.
Did the memory setting help? I tried it with local
type storage today on my Kubernetes cluster, but all seemed to work perfectly fine.
Jup, JMV memory helped. :+1: Can we set this options to? KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava.awt.headless=true"
Well, configuring the KAFKA_JVM_PERFORMANCE_OPTS
is currently not possible. But I agree it might make sense. Wanna open a PR? ;-)
@tombentley I think you already mentioned you have been thinking about this. Any plans how to implement it?
@Dec- I added the options to configure these parameters to master. I think we can now close this issue, right?
AFAIK we have other users using Gluster. So if you have it available, you can definitely give it a try. I tried some googling, but found these slides which list NFS issues. From my experience, NFS is not good match for this kind of system performance wise, but I wouldn't have expected any errors like this. Unfortunately, I do not have any NFS volume available right now to give it a try.
Just for the record ... Gluster normally uses the same network for the storage as is used for the Kafka communication. This increases the network load significantly over using a storage with dedicated network. But leaving the performance aside it should work.
Hi @scholzj, could you please re-share the slides you mentioned for study of KF on NFS? the link shows 500 error. Thanks!
They are not my sleides. So if the URL doesn't work anymore, I do not have any backup.
Hi,
On all of my brokers i got the same error:
2018-05-15 19:20:49,646 ERROR [KafkaServer id=0] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer) [kafka-shutdown-hook] java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down! at kafka.server.KafkaServer.shutdown(KafkaServer.scala:550) at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:48) at kafka.Kafka$$anon$1.run(Kafka.scala:89)
Log partition=xxxxxx, dir=/var/lib/kafka/kafka-log0] Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-log0/xxxxxx/00000000000000000000.log due to Corrupt index found, index file (/var/lib/kafka/kafka-log0/xxxxxx/00000000000000000000.index) has non-zero size but the last offset is 0 which is no greater than the base offset 0.}, recovering segment and rebuilding index files... (kafka.log.Log) [pool-6-thread-1]
2018-05-15 19:20:49,646 ERROR [KafkaServer id=0] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer) [kafka-shutdown-hook] java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down! at kafka.server.KafkaServer.shutdown(KafkaServer.scala:550) at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:48) at kafka.Kafka$$anon$1.run(Kafka.scala:89)
Can some1 help?