streamnative / kop

Kafka-on-Pulsar - A protocol handler that brings native Kafka protocol to Apache Pulsar
https://streamnative.io/docs/kop
Apache License 2.0
450 stars 136 forks source link

[BUG] Namespace bundle for topic not served by this instance. Please redo the lookup. #1308

Open rillo-carrillo opened 2 years ago

rillo-carrillo commented 2 years ago

Describe the bug Producer and Consumer throw errors when trying to publish/consume messages from topics created with bin/pulsar-admin and autoCreatedTopics. This error seems to be intermittent, some times when publishing gets an error sometimes seems to be able to publish messages and the same for consumers. Broker has configured brokerDeleteInactiveTopicsEnabled=false

To Reproduce (Our setup is a 3 broker env) Steps to reproduce the behavior:

  1. Start up your Pulsar Broker with KOP
  2. Create a partitioned topic
  3. use bin/kafka-console-producer.sh to send messages
  4. use bin/kafka-console-consumer.sh to receive messages

Expected behavior Be able to use it every time the topic created

Screenshots 2022-05-25T23:54:24,792+0000 [pulsar-io-5-3] WARN org.apache.pulsar.broker.service.BrokerService - Namespace bundle for topic (persistent://public/default/angeltest-partition-0) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default 2022-05-25T23:54:24,792+0000 [pulsar-io-5-3] WARN io.streamnative.pulsar.handlers.kop.KafkaTopicManager - Get partition-0 error [Namespace bundle for topic (persistent://public/default/angeltest-partition-0) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default]. 2022-05-25T23:54:24,792+0000 [pulsar-io-5-3] INFO org.apache.pulsar.broker.PulsarService - No ledger offloader configured, using NULL instance 2022-05-25T23:54:24,792+0000 [pulsar-io-5-3] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/angeltest 2022-05-25T23:54:24,794+0000 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/angeltest] Closing managed ledger 2022-05-25T23:54:24,794+0000 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x2254288d, L:/10.42.52.172:19092 - R:/10.127.58.41:58628]]Get empty non-partitioned topic for name persistent://public/default/angeltest 2022-05-25T23:54:24,794+0000 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x2254288d, L:/10.42.52.172:19092 - R:/10.127.58.41:58628]] Failed to getTopicConsumerManager caused by getTopic 'persistent://public/default/angeltest-partition-0' returns empty

Additional context Pulsar version: 2.9.1 KOP VERSION 2.9.2.17

BewareMyPower commented 2 years ago

Could you try the streamnative/sn-pulsar:2.9.2.17 image? It looks more like a Pulsar side problem. There should be some fixes since 2.9.1 but I didn't remember the specific PRs.

In addition, could you give more details ?

  1. The number of partitions.
  2. The producer rate.
  3. The complete log files of all brokers during the time that getTopic returns empty.

And if you can reproduce this issue, please run the following command to see what will happen:

./bin/pulsar-admin topics partitioned-lookup angeltest

or specify the specific partition:

./bin/pulsar-admin topics lookup angeltest-partition-0
rillo-carrillo commented 2 years ago

I will try with the new version as mentioned, meanwhile sharing the details:

1.- Just one partition: bin/pulsar-admin topics create-partitioned-topic angeltest -p 1 2.- we just try one message. 3.- Broker logs attached.

Output of commands: [root@broker-0 core]# bin/pulsar-admin topics partitioned-lookup angeltest
"persistent://public/default/angeltest-partition-0 pulsar+ssl://broker-0.brokers.pulsar-poc-v2.svc.cluster.local:6651" [root@broker-0 core]# bin/pulsar-admin topics lookup [angeltest-partition-0] "pulsar+ssl://broker-0.brokers.pulsar-poc-v2.svc.cluster.local:6651"

Logs: broker-2.log broker-1.log broker-0.log

rillo-carrillo commented 2 years ago

Hi @BewareMyPower did this information help to analyze the error shared?

BewareMyPower commented 2 years ago

Sorry I'm busy with other tasks recently, I will take a look at these logs soon.

BewareMyPower commented 2 years ago

It still looks like a Pulsar side bug and triggered by the bundle split.

You can try to reproduce it with bundle split disabled and see if it will still happen.

loadBalancerAutoBundleSplitEnabled=false

You can also try a newer Pulsar like I've said to see if this bug has been fixed.

rillo-carrillo commented 2 years ago

Thanks for the info,

I have made the changes on the broker.conf file and set property mentioned as false. But we still see the issue.

This seems to happen when the Kafka client connects to KOP LB port but the request is resolved by the broker that is not the leader of the topic in the test.

Is there some other property that needs to be set or changed on the broker.conf? Unfortunately we are not able to test it on a Pulsar newer version.

BewareMyPower commented 2 years ago

No. It looks like an issue about K8s deployment. Could you describe how do you run KoP in K8s? Maybe @gaoran10 can provide some help.

rillo-carrillo commented 2 years ago

KOP is running on a 3 k8s node cluster:

on each node I have:

I also have one NodePort svc to balance the load to the proxy running on every broker pod. The same NodePort svc has a port that will target the port specified on kafkaListernes property.

The values are next:

kafkaListeners=kafka_internal://0.0.0.0:9092,kafka_external://0.0.0.0:19092 kafkaProtocolMap=kafka_internal:PLAINTEXT,kafka_external:PLAINTEXT kafkaAdvertisedListeners=kafka_external://:19092,kafka_internal://:9092

The node-name used on the kafkaAdvertisedListeners is a resolvable address of external clients.

rillo-carrillo commented 2 years ago

@BewareMyPower @gaoran10 I have enable a new pulsar cluster using image mentioned above: streamnative/sn-pulsar:2.9.2.17.

Error keeps ocurring when producer/consumer connects to the loadbalancer but the broker is not the leader of the topic in question:

2022-06-07T00:05:03,562+0000 [pulsar-io-4-2] WARN org.apache.pulsar.broker.service.BrokerService - Namespace bundle for topic (persistent://public/default/angeltest-partition-0) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default 2022-06-07T00:05:03,562+0000 [pulsar-io-4-2] WARN io.streamnative.pulsar.handlers.kop.KafkaTopicManager - Get partition-0 error [Namespace bundle for topic (persistent://public/default/angeltest-partition-0) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default]. 2022-06-07T00:05:03,562+0000 [pulsar-io-4-2] WARN org.apache.pulsar.broker.service.BrokerService - Namespace bundle for topic (persistent://public/default/angeltest) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default 2022-06-07T00:05:03,562+0000 [pulsar-io-4-2] WARN io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x349cafb2, L:/10.244.0.30:9092 - R:/10.244.0.1:48024]] Failed to getTopic persistent://public/default/angeltest: Namespace bundle for topic (persistent://public/default/angeltest) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default 2022-06-07T00:05:03,650+0000 [pulsar-web-37-5] INFO org.eclipse.jetty.server.RequestLog - 10.244.0.30 - - [07/Jun/2022:00:05:03 +0000] "GET /admin/v2/persistent/public/default/angeltest/partitions HTTP/1.1" 200 16 "-" "Pulsar-Java-v2.9.2.17" 1 2022-06-07T00:05:03,838+0000 [pulsar-io-4-2] WARN org.apache.pulsar.broker.service.BrokerService - Namespace bundle for topic (persistent://public/default/angeltest-partition-0) not served by this instance. Please redo the lookup. Request is denied: namespace=public/default

rillo-carrillo commented 2 years ago

Issue has been fix, we used to have one service for all the external connections, now we have one service for each broker for external connections.

dispalt commented 1 year ago

Issue has been fix, we used to have one service for all the external connections, now we have one service for each broker for external connections.

Assuming you installed this with helm, how did you manage that?