Error "Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-logx/"

Dec- commented 6 years ago

Hi,

On all of my brokers i got the same error:

2018-05-15 19:20:49,646 ERROR [KafkaServer id=0] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer) [kafka-shutdown-hook] java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down! at kafka.server.KafkaServer.shutdown(KafkaServer.scala:550) at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:48) at kafka.Kafka$$anon$1.run(Kafka.scala:89)

Log partition=xxxxxx, dir=/var/lib/kafka/kafka-log0] Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-log0/xxxxxx/00000000000000000000.log due to Corrupt index found, index file (/var/lib/kafka/kafka-log0/xxxxxx/00000000000000000000.index) has non-zero size but the last offset is 0 which is no greater than the base offset 0.}, recovering segment and rebuilding index files... (kafka.log.Log) [pool-6-thread-1]

2018-05-15 19:20:49,646 ERROR [KafkaServer id=0] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer) [kafka-shutdown-hook] java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down! at kafka.server.KafkaServer.shutdown(KafkaServer.scala:550) at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:48) at kafka.Kafka$$anon$1.run(Kafka.scala:89)

Can some1 help?

scholzj commented 6 years ago

I never saw this kind of error before. It is quite strange that it happened on all brokers. What kind of storage do you use?

Dec- commented 6 years ago

We are using NFS

scholzj commented 6 years ago

Ok. I wonder if that might be the problem. TBH I never tried Kafka with NFS, so I do not know how well it works. How are the NFS volumes provisioned for the different pods? Are you sure that each broker instance has its own NFS volume / its own path on shared volume? I was wondering if they are not overwriting each others files.

Dec- commented 6 years ago

So what kind of storage you recommend?

scholzj commented 6 years ago

I guess it depends what do you have available :-). Local storage is one of the options - the availability can be handled by replication. In AWS for example, the EBS volumes should work reasonably well.

Dec- commented 6 years ago

Still the same problem with the new version (0.4.0). Can it be a configuration problem? Or we really need to use Gluster as storage?

scholzj commented 6 years ago

AFAIK we have other users using Gluster. So if you have it available, you can definitely give it a try. I tried some googling, but found these slides which list NFS issues. From my experience, NFS is not good match for this kind of system performance wise, but I wouldn't have expected any errors like this. Unfortunately, I do not have any NFS volume available right now to give it a try.

Just for the record ... Gluster normally uses the same network for the storage as is used for the Kafka communication. This increases the network load significantly over using a storage with dedicated network. But leaving the performance aside it should work.

Dec- commented 6 years ago

We tried with local disk and we got the same error. Can it be configuration problem?

scholzj commented 6 years ago

That is strange. Could you please share the ConfigMaps you use to deploy the cluster?

Dec- commented 6 years ago

Here it is

apiVersion: v1 data: kafka-config: |- { "default.replication.factor": 1, "offsets.topic.replication.factor": 3, "transaction.state.log.replication.factor": 3 } kafka-healthcheck-delay: '15' kafka-healthcheck-timeout: '5' kafka-metrics-config: |- { "lowercaseOutputName": true, "rules": [ { "pattern": "kafka.server<type=(.+), name=(.+)PerSec\\w*><>Count", "name": "kafka_server_$1_$2_total" }, { "pattern": "kafka.server<type=(.+), name=(.+)PerSec\\w*, topic=(.+)><>Count", "name": "kafka_server_$1_$2_total", "labels": { "topic": "$3" } } ] } kafka-nodes: '3' kafka-storage: '{ "type": "persistent-claim", "size": "10Gi", "delete-claim": false }' topic-operator-config: '{ }' zookeeper-healthcheck-delay: '15' zookeeper-healthcheck-timeout: '5' zookeeper-metrics-config: |- { "lowercaseOutputName": true } zookeeper-nodes: '3' zookeeper-storage: '{ "type": "persistent-claim", "size": "1Gi", "delete-claim": false }' kind: ConfigMap metadata: creationTimestamp: '2018-05-21T21:21:52Z' labels: app: strimzi-dpp strimzi.io/kind: cluster strimzi.io/type: kafka name: strimzi-dpp namespace: strimzi resourceVersion: '8886565' selfLink: /api/v1/namespaces/strimzi/configmaps/strimzi-dpp uid: fb3f543d-5d3c-11e8-9fe5-5254008931bc

Dec- commented 6 years ago

And one more question do we really need log dir? Can we let fluend do his job and disable loging on file system?

scholzj commented 6 years ago

This looks completely normal ... :-(

What do you run it on? OpenShift? Kubernetes? What version?
Do you use some standard storage provisioner / storage class for the persistent claim storage?
When you said that you tried it with local disk - did you meant the ephemeral storage type?

As for the other question ... the log dir referred here is not for logs as in log files which fluentd consumes. It is the directory where the journals where the messages send to Kafka will be stored (= message log).

Dec- commented 6 years ago

We are using OpenShift on-prem 3.9, we tried standard provisioning with NFS and than (https://docs.openshift.com/container-platform/3.7/install_config/configuring_local.html)

Dec- commented 6 years ago

One more thing we find out is that a few minutes before broker is down, there is big spike in memory, from 2Gb to 9Gb, Can it be connected? :)

Btw Thx for helping. :+1:

Dec- commented 6 years ago

It looks like we just didn't have enough memory on the system... :) Our kafka brokers constantly using 12Gb of memory is there a way to optimized it?

scholzj commented 6 years ago

I'm sorry ... I was travelling and didn't got to this yet.

You should be able to configure the Kubernetes resources in the config map: http://strimzi.io/docs/0.4.0/#resources_json_config. You can also configure the JMV -Xmx and -Xms options: http://strimzi.io/docs/0.4.0/#jvm_json_config.

TBH, I'm not sure this is related, but give it a try.

scholzj commented 6 years ago

Did the memory setting help? I tried it with local type storage today on my Kubernetes cluster, but all seemed to work perfectly fine.

Dec- commented 6 years ago

Jup, JMV memory helped. :+1: Can we set this options to? KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava.awt.headless=true"

scholzj commented 6 years ago

Well, configuring the KAFKA_JVM_PERFORMANCE_OPTS is currently not possible. But I agree it might make sense. Wanna open a PR? ;-)

@tombentley I think you already mentioned you have been thinking about this. Any plans how to implement it?

scholzj commented 6 years ago

@Dec- I added the options to configure these parameters to master. I think we can now close this issue, right?

TheDevarshiShah commented 1 year ago

AFAIK we have other users using Gluster. So if you have it available, you can definitely give it a try. I tried some googling, but found these slides which list NFS issues. From my experience, NFS is not good match for this kind of system performance wise, but I wouldn't have expected any errors like this. Unfortunately, I do not have any NFS volume available right now to give it a try.

Just for the record ... Gluster normally uses the same network for the storage as is used for the Kafka communication. This increases the network load significantly over using a storage with dedicated network. But leaving the performance aside it should work.

Hi @scholzj, could you please re-share the slides you mentioned for study of KF on NFS? the link shows 500 error. Thanks!

scholzj commented 1 year ago

They are not my sleides. So if the URL doesn't work anymore, I do not have any backup.

strimzi / strimzi-kafka-operator

Error "Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-logx/" #441