Closed alexanderphoenix closed 3 years ago
Hi @alexanderphoenix , we run Kafka Minion in environments with lots of spring boot microservices and we can monitor the consumer groups.
Missing low watermark:
{"level":"warning","module":"collector","msg":"could not calculate partition lag because low water mark is missing","partition":3,"time":"2020-04-06T14:35:51Z","topic":"TOPIC_NAME"}
As the warn message already says, if Kafka Minion can not fetch the low water mark of a partition it can not calculate the topic lag for consumers on this topic. I have no idea why it can not fetch the low watermark on that partition? Can you start investigations around this? What's up with this partition, which broker is the leader, is it available etc.
Low watermarks are always 0:
and no matter what topic I choose, the low water mark is always 0, for example:
This is expected if your topics use the default cleanup policy (compaction).
Amazon MSK
I haven't used MSK myself nor do I know much about it. It's possible that there are incompatibilities with MSK. I recommend you to read the readme to understand how Kafka Minion works. The main difference to other consumers is that it reads the internal __consumer_offsets
topic while others use the Kafka API to fetch watermarks and group offsets.
For instance there were reports that Kafka Minion is incompatible with Confluent Cloud because this topic is not accessible for customers. You should get clear error messages if that's the case though.
Hi @weeco ,
Thank you for confirming your setup works with kafka minion.
Based on you input I'm investigating whether this could be a problem with MSK. I'm also deploying Burrow as a test to see if it faces the same problems.
Will get back soon with my findings.
Cheers,
Alex
@alexanderphoenix Have you tried Burrow?
Hi @weeco, I did and unfortunately it was exhibiting exactly the same behaviour as kafka-minion. We investigated spring boot and the way it was producing and consuming, but couldn't find anything there either, especially when we considered that other tools like conduktor and kafdrop were seeing all the topics.
Hello @weeco We face similar issue
I believe we see just 'few' metrics eg 'group' metrics work well eg: kafka_server_fetcherlagmetrics_consumerlag but I see no topic metrics - no offset, no lag ...
I'm not exactly sure how correct logs should look like... is this correct?
I've checked, there are some offsets in __consumer_offsets topic
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning
Via: https://www.reddit.com/r/apachekafka/comments/g11ffw/kafka_minion_consumer_group_lag_calculation/fngnbg6/ "Kafka Minion does not support consumer groups which still commit to Zookeeper, therefore it doesn't have any Zookeeper dependencies while Burrow supports those" I do not think we are using zookeeper for offset storage neither we have messed up "offsets.storage=kafka setting is responsible for this" - i believe this is default same for default setting of enable.auto.commit=true
I need to re-check zookeeper, maybe also check other tools? Need to check console-producer as @alexanderphoenix presented, thanks. By the way, would kafka-exporter provide consumer-lag? Does it work that different? Or is it some strange behavior of our springboot kafka client, kafka server or kafka-minion?
@sirkubax Please always check the /metrics
endpoint of kafka minion if you suspect missing metric serieses. There are futher point of failures inbetween Kafka Minion and the Prometheus UI / TSDB where metrics are stored.
There are different ways of getting consumer group offsets from Kafka. Kafka Minion is different than most exporters as it consumes the __consumer_offsets
topic (as described in the README) while others talk to the Kafka Brokers and ask for the offsets. You can give https://github.com/cloudhut/kowl a shot as you can see consumer group offsets there as well, but Kowl asks the Kafka brokers for the offset using Kafka's admin API.
@weeco thank you for quick reply
here is my /metrics endpoint dump: https://pastebin.com/F0fM7ktB
I don't think I see any of kafka_minion_topic_partition neither kafka_minion_group_topic_partition_lag
@sirkubax Is the Kafka Minion consumer allowed to consume the Kafka topic? Confluent Cloud clusters for instance do not allow customers to consume the __consumer_offsets
topic.
so I'm almost sure it can consume that topic as I executed Do you suspect that it may be permission related, in a way it is consuming 'some' data...
This are the ACL I have
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal "User:kafka-minion-service-dev" --operation read --
topic __consumer_offsets
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --list --topic __consumer_offsets
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
I'm not 100% sure since in the console I was testing with admin certs with following command
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning
and I did not see any error on kafka-minion logs (see above - looks like it is consuming something), BUT true, I'm not sure if the kafka-minion-service-dev user has all rights... let me check it...
FYI I've tried @alexanderphoenix example (adjusted commands here) without success:
/usr/local/kafka/bin/kafka-topics.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --create --topic kubatest
echo "TEST"| /usr/local/kafka/bin/kafka-console-producer.sh --broker-list kafka-1.test-xxx.xxx:9093 --producer.config /usr/local/kafka/config/admin.properties --topic kubatest
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic kubatest --consumer-property group.id=kubatest
TEST
^CProcessed a total of 1 messages
Here are related metrics:
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="0",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="1",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="2",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="3",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="4",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="5",topic="kubatest"} 5
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="0",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="1",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="2",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="3",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="4",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="5",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="0",topic="kubatest"} 0
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="1",topic="kubatest"} 0
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="2",topic="kubatest"} 1
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="3",topic="kubatest"} 2
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="4",topic="kubatest"} 0
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="5",topic="kubatest"} 0
let me check that permission you have mentioned with kafka-minion-service-dev user with command
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning
should I expect any error?
did not (yet) succeeded consuming topic with kafka-minion-service-dev user (I think some crt to JKS conversion issue - fixing)
@weeco since I have a dump of __consumer_offsets created with admin user is there any parameter I may look for in order to check if the data for the metrics _kafka_minion_topicpartition* or _kafka_minion_group_topic_partitionlag is available in this topic ?
ok, I'm 99% sure it is permission issue
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /home/ansible-bot/certs/kafka-minion-client.properties --topic __consumer_offsets --from-beginning --consumer-property group.id=kafka-minion-service-dev
Processed a total of 509 messages
vs admin
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning
Processed a total of 11389 messages
are you sure it is only reading __consumer_offsets topic? seems that admin user is processing way more data...
Initially I did only
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal "User:kafka-minion-service-dev" --operation read --
topic __consumer_offsets
Ok, so if I observed this right, there were ACL missing, and unfortunately it is not documented :/
I checked this a bit by 'brute-force', here is similar issue https://github.com/linkedin/Burrow/issues/378 I was looking for a minimal set of permissions, and this is what I got
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --list --principal User:kafka-minion-service-dev
ACLs for principal `User:kafka-minion-service-dev`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
I'm not so happy to grant READ to all topics and groups --topic "*" --group=*
- it is required to observe consumer lag and other properties... I initially thought this was stored in __consumer_offsets - I must have understood it wrongly, or some extra permissions are checked while extracting data from __consumer_offsets topic... please consider whether or not this concept is properly documented..
ISSUES: without group READ you would have this issue:
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /home/ansible-bot/certs/kafka-minion-client.properties --topic __consumer_offsets --from-beginning |tee kafka-minion-service-dev.___consumer_offsets |strings |wc
[2020-09-27 16:57:15,141] ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: console-consumer-9833
so fix it with
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev --group '*' --operation=READ
once you give READ permissions you can consume offsets
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /home/ansible-bot/certs/kafka-minion-client.properties --topic __consumer_offsets --from-beginning |
tee kafka-minion-service-dev.___consumer_offsets |strings |wc
^CProcessed a total of 17708 messages
Another issue - no partition offset:
{"level":"warning","module":"collector","msg":"could not calculate partition lag because low water mark is missing","partition":3,"time":"2020-09-27T14:02:09Z","topic":"kubatest"}
fix it with READ permission for topics
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev --topic '*' --operation=READ
│ {"broker_id":3,"level":"debug","module":"cluster","msg":"received partition low water mark","offset":0,"partition":4,"time":"2020-09-27T14:02:43Z","timestamp":1601215363000,"topic":"kubatest"}
You also need DescribeConfigs in order to see partition count, partition cleanup policy etc kafka_minion_topic_partition_count
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev --topic '*' --operation=DescribeConfigs
and of course you need that ACL as I've mentioned earlier for __consumer_offsets
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal "User:kafka-minion-service-dev" --operation READ --
topic __consumer_offsets
The __consumer_offsets
topic only stores the current offset for each group. I'm not entirely sure why it would need read access for all consumer groups. In fact Kafka Minion itself does not use a consumer group, nor does it talk to the Admin API to describe other consumer groups.
As for the topics it has to describe the topics in order to get the high and low watermark for each topic, along with the cleanup policy.
In our environment, we don't need Read ACL for kafka-minion on data topics, other than the __consumer_offsets
. We do need Describe on all topics. Maybe Describe is necessary, and Read implicitly grants that so it worked in your case.
What we have:
# Describe on Cluster
Current ACLs for resource `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL)`:
(principal=User:monitoring, host=*, operation=DESCRIBE, permissionType=ALLOW)
# Describe, DescribeConfigs on all Topics
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
(principal=User:monitoring, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
(principal=User:monitoring, host=*, operation=DESCRIBE, permissionType=ALLOW)
# Describe, Read (possibly unnecesasry) on Groups
Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`:
(principal=User:monitoring, host=*, operation=DESCRIBE, permissionType=ALLOW)
(principal=User:monitoring, host=*, operation=READ, permissionType=ALLOW)
# Read on topic __consumer_offsets
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
(principal=User:monitoring, host=*, operation=READ, permissionType=ALLOW)
We gathered the necessary ACLs by looking in kafka-authorizer log denials. Do you have access to the authorizer log?
I'd give another try with READ removed, stay I'd let you know.
I do not have Describe on a cluster-wide scope, I wonder if it is required.... (for future readers, you apply it with --cluster) eg:
/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev --cluster --operation=Describe
Almost final ACL:
ACLs for principal `User:kafka-minion-service-dev`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW)
Are you sure cluster Describe is required? Seems to be working without it.
Great note with kafka-authorizer logs, I ignored it for a while - I've just checked it (post-factum) - most of the info could be found there. I did not see all though... :/
Are you sure we need a group Read permission? I've just removed them, and no error spotted yet... I've reduced ACL to
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`:
(principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW)
Are you sure cluster Describe is required?
Withouth that, our clusters (confluent-package kafka 2.1 and 2.2) logs a denial every minute:
:35:51,425] INFO Principal = User:monitoring is Denied Operation = Describe from host = 10.1.2.3 on resource = Cluster:LITERAL:kafka-cluster
:36:51,344] INFO Principal = User:monitoring is Denied Operation = Describe from host = 10.1.2.3 on resource = Cluster:LITERAL:kafka-cluster
Maybe someone can point out the relevant scraping code.
a group Read permission
I think it's not needed, but I granted all accounts Read on all groups anyway, as part of transitioning from no-ACL to enforcing ACL. (We surely should tighten that since group offsets changing can cause data inconsistencies)
@hdhoang
Maybe someone can point out the relevant scraping code.
a group Read permission
I could imagine that this is needed because the __consumer_offsets
topic contains consumer group permissions and the Kafka ACL authorizer might be aware of that?
Funny case We have created another cluster with the minimal set of rules, now the group metrics are missing
is this DescribeConfig for a group or a READ for a group?
or in the end - are we missing this CLUSTER permissions?
The problem is I do not see any error in debug kafka-minion logs, neither in /usr/local/kafka/logs/kafka-authorizer.log
ok, false statement
I did not mention, that this (group metrics )error happened on a fresh installation of kafka cluster. There was no consumer, no consumer group -> no metrics...
Would be nice to add some info in logs, or publish empty metrics...
solved with earlier commands
/usr/local/kafka/bin/kafka-topics.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --create --topic kubatest
echo "TEST"| /usr/local/kafka/bin/kafka-console-producer.sh --broker-list kafka-1.test-xxx.xxx:9093 --producer.config /usr/local/kafka/config/admin.properties --topic kubatest
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic kubatest --consumer-property group.id=kubatest
TEST
^CProcessed a total of 1 messages
I've starting looking at this again with @alexanderphoenix as we'd put it on-hold while working on other tasks.
I've added the most permissive ACL setup I can think of and we still don't see any lag metrics in Prometheus for our customer apps, but we do if we create a test topic using kafka/bin scripts.
This is just an experimental configuration as I wanted to prune permissions once I had a working configuration. I'm only showing here the ACLs I've added. There are hundreds of other ACLs set (allow rules for topics, no deny rules), but nonetheless, these rules should allow kafka-minion to perform any operation on any topic including __consumer_offsets (as I understand it)
Can anyone see a reason why this wouldn't work (for lag on all topics)?
Current ACLs for resource `Group:LITERAL:*`:
User:CN=REDACTED has Allow permission for operations: All from hosts: *
Current ACLs for resource `Topic:PREFIXED:*`:
User:CN=REDACTED has Allow permission for operations: All from hosts: *
The pod logs seem to suggest that metric collection is occurring though:
{"level":"debug","module":"cluster","msg":"starting to collect topic offsets","time":"2020-10-22T12:03:42Z"}
...
{"level":"debug","module":"cluster","msg":"collected topic offsets","time":"2020-10-22T12:03:42Z"}
Thanks in advance for any help
Can anyone see a reason why this wouldn't work (for lag on all topics)? Thanks in advance for any help
I've added the most permissive ACL setup I can think of and we still don't see any lag metrics in Prometheus for our customer apps, but we do if we create a test topic using kafka/bin scripts
My guess here is, there is nothing to consume on a 'fresh', 'empty' kafka cluster, until you create some resource. Is that your case, is it an empty cluster?
Other thought, is there anything in /usr/local/kafka/logs/kafka-authorizer.log
?
Thanks for looking into this @sirkubax
I can say that it's a very active cluster. It has hundreds of topics on it and hundreds of ACL's which provide control over access to topics. There are no deny rules, which so far as I understand should mean that the ACL's I mentioned above are as permissive as possible (I think)
I tried applying a '--cluster' rule as well. I tried both operation=All
(again trying to be maximally permissive for testing) and operation=Describe
. In both case, it breaks consumer access and the following error is seen on a consumer connection
[2020-11-05 11:38:25,989] WARN [Consumer clientId=consumer-1, groupId=test-consumer-group-3] Offset commit failed on partition sandpit.demo.test_topic-0 at offset 5540: The request timed out. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
Referring back to the earlier problem description. If I consume from a topic using kafka-console-consumer.sh, then I see lag. For the customer topics, I don't see any lag. Those connections are coming from in-house java apps and are providing a consumer-group to the connection object. I can also see those consumer groups when I run the kafka-consumer-groups
command
So, I agree that this seems like it could be related to permissions, but when I give permission to everything (that I know of), I still don't see the lag. It's a very strange situation.
I don't have access to the logs directly, but I'll see if I can get hold of the logs from AWS
@justCatchingRye Maybe your app uses older kafka api? or store offsets in zookeeper? And if so, only then you when you are consuming it with kafka-consumer-groups actually create that stats in kafka backend.
I think offsets.storage=kafka setting is responsible https://gitter.im/spring-projects/spring-kafka?at=5acce83f6bbe1d2739d0bd24
https://kafka.apache.org/22/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html Storing Offsets Outside Kafka
Thanks @sirkubax, I checked this with customers to try and find out if there is anything particular about their client setup. I'm also, asking the same of AWS on the server end
What I can say is that the kafka client apps that we are troubleshooting have no zookeeper configuration. They only have broker endpoints defined. Further to that they are network blocked from talking to zookeeper endpoints. They can only talk to kafka endpoints if they present a valid certificate and have appropriate ACL entries to permits access to particular topics
I've looked into this further. I don't think there can be any issue with where offsets are stored, our clients have no network access to zk. Lag monitoring (kafka-minion) works so long as there are no ACL's set against a topic. As soon ACLs are configured for a topic, lag monitoring stops.
For example, if I apply this set of rules so that I can consume and produce:
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=sandpit.demo.test_topic, patternType=LITERAL)`:
(principal=User:CN=sampleAppUser, host=*, operation=READ, permissionType=ALLOW)
(principal=User:CN=sampleAppUser, host=*, operation=CREATE, permissionType=ALLOW)
(principal=User:CN=sampleAppUser, host=*, operation=WRITE, permissionType=ALLOW)
(principal=User:CN=sampleAppUser, host=*, operation=DESCRIBE, permissionType=ALLOW)
As soon as this is applied, monitoring stops. If I remove the entries it works again.
If I add additional rules for monitoring, they either have no effect or they break kafka-minion entirely (see below error)
1) Added read and describe to all topics
kafka-acls.sh --authorizer-properties zookeeper.connect=$zookeeper --add --allow-principal User:CN=$cn --operation Read --operation Describe --group '*' --topic '*'
Leads to this issue in kafka-minion, which doesn't really make sense if this rule is supposed to give access to ALL topics...
{"error":"kafka server: The client is not authorized to access this topic.","level":"panic","msg":"failed to get partition count","time":"2020-11-18T08:03:17Z","topic":"__consumer_offsets"}
2) Added read and describe to __consumer_offsets
kafka-acls.sh --authorizer-properties zookeeper.connect=$zookeeper --add --allow-principal User:CN=$cn --operation Read --operation Describe --group '*' --topic '__consumer_offsets'
Error above is seen on startup
3) Add cluster-level describe
kafka-acls.sh --authorizer-properties zookeeper.connect=$zookeeper --add --allow-principal User:CN=$cn --operation DESCRIBE --group '*' --topic '*' --cluster
No effect
I think what I need is a reference implementation. Something that shows a topic that is both secured by ACLs AND able to be monitored. I'm working on this, but I've tried lots of combinations already... I see no reason why the configuration below doesn't work (and causes the error above on startup):
Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`:
(principal=User:CN=sampleAppUser, host=*, operation=READ, permissionType=ALLOW)
(principal=User:CN=kafkaMinionUser, host=*, operation=ALL, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
(principal=User:CN=kafkaMinionUser, host=*, operation=ALL, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL)`:
(principal=User:CN=kafkaMinionUser, host=*, operation=DESCRIBE, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
(principal=User:CN=kafkaMinionUser, host=*, operation=ALL, permissionType=ALLOW)
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=sandpit.demo.test_topic, patternType=LITERAL)`:
(principal=User:CN=sampleAppUser, host=*, operation=READ, permissionType=ALLOW)
(principal=User:CN=sampleAppUser, host=*, operation=CREATE, permissionType=ALLOW)
(principal=User:CN=sampleAppUser, host=*, operation=WRITE, permissionType=ALLOW)
(principal=User:CN=sampleAppUser, host=*, operation=DESCRIBE, permissionType=ALLOW)
If anyone can see anything amiss, I'd be happy for any feedback. I will also ask AWS for the auth logs for this setup.
@justCatchingRye THere are some managed offerings (e. g. Confluent Cloud) that use one kafka cluster for multiple customers. In that case you won't ever get access to the __consumer_offsets
topic. I'm not sure if this applies to the AWS offering though.
Thanks @weeco, I'm thinking that because it works when there are no ACLs configured that this proves that we do have access to __consumer_offsets
, unless i'm misunderstanding and making too much of an assumption there?
Been playing around with ACL's to get Kafka-minion working after also getting the could not calculate partition lag because low water mark is missing
message.
If you are using the Kafka GitOps project to manage your ACL's, then this is the set of customServiceAcls
you need.
If you're using the command line kafka-acls tool, I'm sure you can translate the below in input for that tool.
kafka-minion:
acl1:
name: __consumer_offsets
type: TOPIC
pattern: LITERAL
host: "*"
principal: User:kafka-minion
operation: READ
permission: ALLOW
acl2:
name: __consumer_offsets
type: TOPIC
pattern: LITERAL
host: "*"
principal: User:kafka-minion
operation: DESCRIBE_CONFIGS
permission: ALLOW
acl3:
name: kafka-cluster
type: CLUSTER
pattern: LITERAL
principal: User:kafka-minion
host: "*"
operation: DESCRIBE
permission: ALLOW
acl4:
name: "*"
type: TOPIC
pattern: LITERAL
principal: User:kafka-minion
host: "*"
operation: DESCRIBE
permission: ALLOW
acl5:
name: "*"
type: TOPIC
pattern: LITERAL
principal: User:kafka-minion
host: "*"
operation: DESCRIBE_CONFIGS
permission: ALLOW
Thanks for all your inputs regarding the ACLs. I hope that the V2 will return more descriptive errors when Kafka cluster requests fail due to lacking permissions. We replaced sarama with franz-go which seems to be a superior Kafka client.
Hi all,
I'm facing an issue where after deploying kafka-minion on openshift, it's able to see and calculate group lag for messages that I'm creating and consuming via the kafka CLI commands, but not ones generated via the spring boot framework.
Producer:
Consumer:
We're using other tools to monitor our
MSK instances
(like Kafdrop and Conduktor) and those are able to calculate the consumer lag reliably on all topics.When I query
kafka_minion_group_topic_lag
in Prometheus I can only see the topic and group generated via CLI.I should mention that I'm seeing messages on the kafka-minion logs to do with
partition lag
(edited to remove topic), but I'm not sure if they're related to the fact I can't see the consumer lag:and no matter what topic I choose, the
low water mark
is always0
, for example:I'm fairly new to Kafka and its ancillaries, so please let me know if there's anything else I can provide to help diagnose the issue.
Many thanks, and stay safe!