redpanda-data / kminion

KMinion is a feature-rich Prometheus exporter for Apache Kafka written in Go. It is lightweight and highly configurable so that it will meet your requirements.
MIT License
620 stars 122 forks source link

Consumer Group Lag calculation missing for most topics #40

Closed alexanderphoenix closed 3 years ago

alexanderphoenix commented 4 years ago

Hi all,

I'm facing an issue where after deploying kafka-minion on openshift, it's able to see and calculate group lag for messages that I'm creating and consuming via the kafka CLI commands, but not ones generated via the spring boot framework.

Producer:

while true; do echo "TEST"; sleep $[ ( $RANDOM % 100) + 1 ]; done | kafka-console-producer.sh --broker-list $broker --producer.config $clientProperties --topic ops.test.topic

Consumer:

kafka-console-consumer.sh --bootstrap-server $broker -topic ops.test.topic --consumer.config $clientProperties --consumer-property group.id=test-consumer-group

We're using other tools to monitor our MSK instances (like Kafdrop and Conduktor) and those are able to calculate the consumer lag reliably on all topics.

When I query kafka_minion_group_topic_lag in Prometheus I can only see the topic and group generated via CLI.

I should mention that I'm seeing messages on the kafka-minion logs to do with partition lag (edited to remove topic), but I'm not sure if they're related to the fact I can't see the consumer lag:

{"level":"warning","module":"collector","msg":"could not calculate partition lag because low water mark is missing","partition":3,"time":"2020-04-06T14:35:51Z","topic":"TOPIC_NAME"}

and no matter what topic I choose, the low water mark is always 0, for example:

kafka_minion_topic_partition_high_water_mark{partition="4",topic="TOPIC_NAME"} 152
kafka_minion_topic_partition_low_water_mark{partition="0",topic="TOPIC_NAME"} 0

I'm fairly new to Kafka and its ancillaries, so please let me know if there's anything else I can provide to help diagnose the issue.

Many thanks, and stay safe!

weeco commented 4 years ago

Hi @alexanderphoenix , we run Kafka Minion in environments with lots of spring boot microservices and we can monitor the consumer groups.

Missing low watermark:

{"level":"warning","module":"collector","msg":"could not calculate partition lag because low water mark is missing","partition":3,"time":"2020-04-06T14:35:51Z","topic":"TOPIC_NAME"}

As the warn message already says, if Kafka Minion can not fetch the low water mark of a partition it can not calculate the topic lag for consumers on this topic. I have no idea why it can not fetch the low watermark on that partition? Can you start investigations around this? What's up with this partition, which broker is the leader, is it available etc.

Low watermarks are always 0:

and no matter what topic I choose, the low water mark is always 0, for example:

This is expected if your topics use the default cleanup policy (compaction).

Amazon MSK

I haven't used MSK myself nor do I know much about it. It's possible that there are incompatibilities with MSK. I recommend you to read the readme to understand how Kafka Minion works. The main difference to other consumers is that it reads the internal __consumer_offsets topic while others use the Kafka API to fetch watermarks and group offsets.

For instance there were reports that Kafka Minion is incompatible with Confluent Cloud because this topic is not accessible for customers. You should get clear error messages if that's the case though.

alexanderphoenix commented 4 years ago

Hi @weeco ,

Thank you for confirming your setup works with kafka minion.

Based on you input I'm investigating whether this could be a problem with MSK. I'm also deploying Burrow as a test to see if it faces the same problems.

Will get back soon with my findings.

Cheers,

Alex

weeco commented 4 years ago

@alexanderphoenix Have you tried Burrow?

alexanderphoenix commented 4 years ago

Hi @weeco, I did and unfortunately it was exhibiting exactly the same behaviour as kafka-minion. We investigated spring boot and the way it was producing and consuming, but couldn't find anything there either, especially when we considered that other tools like conduktor and kafdrop were seeing all the topics.

sirkubax commented 4 years ago

Hello @weeco We face similar issue

I believe we see just 'few' metrics eg 'group' metrics work well eg: kafka_server_fetcherlagmetrics_consumerlag but I see no topic metrics - no offset, no lag ... image

I'm not exactly sure how correct logs should look like... is this correct? image

I've checked, there are some offsets in __consumer_offsets topic /usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning

Via: https://www.reddit.com/r/apachekafka/comments/g11ffw/kafka_minion_consumer_group_lag_calculation/fngnbg6/ "Kafka Minion does not support consumer groups which still commit to Zookeeper, therefore it doesn't have any Zookeeper dependencies while Burrow supports those" I do not think we are using zookeeper for offset storage neither we have messed up "offsets.storage=kafka setting is responsible for this" - i believe this is default same for default setting of enable.auto.commit=true

I need to re-check zookeeper, maybe also check other tools? Need to check console-producer as @alexanderphoenix presented, thanks. By the way, would kafka-exporter provide consumer-lag? Does it work that different? Or is it some strange behavior of our springboot kafka client, kafka server or kafka-minion?

weeco commented 4 years ago

@sirkubax Please always check the /metrics endpoint of kafka minion if you suspect missing metric serieses. There are futher point of failures inbetween Kafka Minion and the Prometheus UI / TSDB where metrics are stored.

There are different ways of getting consumer group offsets from Kafka. Kafka Minion is different than most exporters as it consumes the __consumer_offsets topic (as described in the README) while others talk to the Kafka Brokers and ask for the offsets. You can give https://github.com/cloudhut/kowl a shot as you can see consumer group offsets there as well, but Kowl asks the Kafka brokers for the offset using Kafka's admin API.

sirkubax commented 4 years ago

@weeco thank you for quick reply

here is my /metrics endpoint dump: https://pastebin.com/F0fM7ktB

I don't think I see any of kafka_minion_topic_partition neither kafka_minion_group_topic_partition_lag

weeco commented 4 years ago

@sirkubax Is the Kafka Minion consumer allowed to consume the Kafka topic? Confluent Cloud clusters for instance do not allow customers to consume the __consumer_offsets topic.

sirkubax commented 4 years ago

so I'm almost sure it can consume that topic as I executed Do you suspect that it may be permission related, in a way it is consuming 'some' data... image image

This are the ACL I have

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal "User:kafka-minion-service-dev" --operation read  --
topic __consumer_offsets

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --list --topic __consumer_offsets
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW) 

I'm not 100% sure since in the console I was testing with admin certs with following command /usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning and I did not see any error on kafka-minion logs (see above - looks like it is consuming something), BUT true, I'm not sure if the kafka-minion-service-dev user has all rights... let me check it...

FYI I've tried @alexanderphoenix example (adjusted commands here) without success:

/usr/local/kafka/bin/kafka-topics.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --create --topic kubatest

echo "TEST"|  /usr/local/kafka/bin/kafka-console-producer.sh --broker-list kafka-1.test-xxx.xxx:9093 --producer.config /usr/local/kafka/config/admin.properties --topic kubatest

/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic  kubatest --consumer-property group.id=kubatest
TEST
^CProcessed a total of 1 messages

Here are related metrics:

kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="0",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="1",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="2",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="3",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="4",topic="kubatest"} 5
kafka_minion_group_topic_partition_commit_count{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="5",topic="kubatest"} 5

kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="0",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="1",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="2",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="3",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="4",topic="kubatest"} 1.601028620459e+12
kafka_minion_group_topic_partition_last_commit{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="5",topic="kubatest"} 1.601028620459e+12

kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="0",topic="kubatest"} 0
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="1",topic="kubatest"} 0
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="2",topic="kubatest"} 1
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="3",topic="kubatest"} 2
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="4",topic="kubatest"} 0
kafka_minion_group_topic_partition_offset{group="kubatest",group_base_name="kubatest",group_is_latest="true",group_version="0",partition="5",topic="kubatest"} 0

let me check that permission you have mentioned with kafka-minion-service-dev user with command /usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning should I expect any error?

sirkubax commented 4 years ago

did not (yet) succeeded consuming topic with kafka-minion-service-dev user (I think some crt to JKS conversion issue - fixing)

@weeco since I have a dump of __consumer_offsets created with admin user is there any parameter I may look for in order to check if the data for the metrics _kafka_minion_topicpartition* or _kafka_minion_group_topic_partitionlag is available in this topic ?

sirkubax commented 4 years ago

ok, I'm 99% sure it is permission issue

/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /home/ansible-bot/certs/kafka-minion-client.properties --topic __consumer_offsets --from-beginning  --consumer-property group.id=kafka-minion-service-dev
Processed a total of 509 messages

vs admin

/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic __consumer_offsets --from-beginning
Processed a total of 11389 messages

are you sure it is only reading __consumer_offsets topic? seems that admin user is processing way more data...

Initially I did only

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal "User:kafka-minion-service-dev" --operation read  --
topic __consumer_offsets
sirkubax commented 4 years ago

Ok, so if I observed this right, there were ACL missing, and unfortunately it is not documented :/

I checked this a bit by 'brute-force', here is similar issue https://github.com/linkedin/Burrow/issues/378 I was looking for a minimal set of permissions, and this is what I got

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --list --principal User:kafka-minion-service-dev

ACLs for principal `User:kafka-minion-service-dev`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW) 

I'm not so happy to grant READ to all topics and groups --topic "*" --group=* - it is required to observe consumer lag and other properties... I initially thought this was stored in __consumer_offsets - I must have understood it wrongly, or some extra permissions are checked while extracting data from __consumer_offsets topic... please consider whether or not this concept is properly documented..

ISSUES: without group READ you would have this issue:

/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /home/ansible-bot/certs/kafka-minion-client.properties --topic __consumer_offsets --from-beginning   |tee  kafka-minion-service-dev.___consumer_offsets |strings |wc
[2020-09-27 16:57:15,141] ERROR Error processing message, terminating consumer process:  (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: console-consumer-9833

so fix it with

 /usr/local/kafka/bin/kafka-acls.sh --bootstrap-server  kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev --group '*'    --operation=READ

once you give READ permissions you can consume offsets

/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /home/ansible-bot/certs/kafka-minion-client.properties --topic __consumer_offsets --from-beginning   |
tee  kafka-minion-service-dev.___consumer_offsets |strings |wc
^CProcessed a total of 17708 messages

Another issue - no partition offset:

{"level":"warning","module":"collector","msg":"could not calculate partition lag because low water mark is missing","partition":3,"time":"2020-09-27T14:02:09Z","topic":"kubatest"}    

fix it with READ permission for topics

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server  kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev --topic '*'    --operation=READ

│ {"broker_id":3,"level":"debug","module":"cluster","msg":"received partition low water mark","offset":0,"partition":4,"time":"2020-09-27T14:02:43Z","timestamp":1601215363000,"topic":"kubatest"} 

You also need DescribeConfigs in order to see partition count, partition cleanup policy etc kafka_minion_topic_partition_count

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server  kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev  --topic  '*' --operation=DescribeConfigs

and of course you need that ACL as I've mentioned earlier for __consumer_offsets

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal "User:kafka-minion-service-dev" --operation READ  --
topic __consumer_offsets
weeco commented 4 years ago

The __consumer_offsets topic only stores the current offset for each group. I'm not entirely sure why it would need read access for all consumer groups. In fact Kafka Minion itself does not use a consumer group, nor does it talk to the Admin API to describe other consumer groups.

As for the topics it has to describe the topics in order to get the high and low watermark for each topic, along with the cleanup policy.

hdhoang commented 4 years ago

In our environment, we don't need Read ACL for kafka-minion on data topics, other than the __consumer_offsets. We do need Describe on all topics. Maybe Describe is necessary, and Read implicitly grants that so it worked in your case.

What we have:

# Describe on Cluster
Current ACLs for resource `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL)`: 
        (principal=User:monitoring, host=*, operation=DESCRIBE, permissionType=ALLOW) 

# Describe, DescribeConfigs on all Topics
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
        (principal=User:monitoring, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
        (principal=User:monitoring, host=*, operation=DESCRIBE, permissionType=ALLOW)

# Describe, Read (possibly unnecesasry) on Groups
Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`: 
        (principal=User:monitoring, host=*, operation=DESCRIBE, permissionType=ALLOW)
        (principal=User:monitoring, host=*, operation=READ, permissionType=ALLOW)

# Read on topic __consumer_offsets
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`: 
        (principal=User:monitoring, host=*, operation=READ, permissionType=ALLOW) 

We gathered the necessary ACLs by looking in kafka-authorizer log denials. Do you have access to the authorizer log?

sirkubax commented 4 years ago

I'd give another try with READ removed, stay I'd let you know.

sirkubax commented 4 years ago

I do not have Describe on a cluster-wide scope, I wonder if it is required.... (for future readers, you apply it with --cluster) eg:

/usr/local/kafka/bin/kafka-acls.sh --bootstrap-server  kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --add --allow-principal User:kafka-minion-service-dev  --cluster --operation=Describe

Almost final ACL:

ACLs for principal `User:kafka-minion-service-dev`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW)
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW) 

Are you sure cluster Describe is required? Seems to be working without it.

Great note with kafka-authorizer logs, I ignored it for a while - I've just checked it (post-factum) - most of the info could be found there. I did not see all though... :/

Are you sure we need a group Read permission? I've just removed them, and no error spotted yet... I've reduced ACL to

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=READ, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE_CONFIGS, permissionType=ALLOW)
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW) 

Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`: 
        (principal=User:kafka-minion-service-dev, host=*, operation=DESCRIBE, permissionType=ALLOW) 
hdhoang commented 4 years ago

Are you sure cluster Describe is required?

Withouth that, our clusters (confluent-package kafka 2.1 and 2.2) logs a denial every minute:

:35:51,425] INFO Principal = User:monitoring is Denied Operation = Describe from host = 10.1.2.3 on resource = Cluster:LITERAL:kafka-cluster
:36:51,344] INFO Principal = User:monitoring is Denied Operation = Describe from host = 10.1.2.3 on resource = Cluster:LITERAL:kafka-cluster

Maybe someone can point out the relevant scraping code.

a group Read permission

I think it's not needed, but I granted all accounts Read on all groups anyway, as part of transitioning from no-ACL to enforcing ACL. (We surely should tighten that since group offsets changing can cause data inconsistencies)

weeco commented 4 years ago

@hdhoang

Maybe someone can point out the relevant scraping code.

https://github.com/cloudworkz/kafka-minion/blob/9842fdc275844315b7034f9ec4fddb2a3151addd/kafka/cluster.go#L258

a group Read permission

I could imagine that this is needed because the __consumer_offsets topic contains consumer group permissions and the Kafka ACL authorizer might be aware of that?

sirkubax commented 4 years ago

Funny case We have created another cluster with the minimal set of rules, now the group metrics are missing

image

is this DescribeConfig for a group or a READ for a group?

or in the end - are we missing this CLUSTER permissions?

The problem is I do not see any error in debug kafka-minion logs, neither in /usr/local/kafka/logs/kafka-authorizer.log

sirkubax commented 4 years ago

ok, false statement

I did not mention, that this (group metrics )error happened on a fresh installation of kafka cluster. There was no consumer, no consumer group -> no metrics...

Would be nice to add some info in logs, or publish empty metrics...

solved with earlier commands

/usr/local/kafka/bin/kafka-topics.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --command-config /usr/local/kafka/config/admin.properties --create --topic kubatest

echo "TEST"|  /usr/local/kafka/bin/kafka-console-producer.sh --broker-list kafka-1.test-xxx.xxx:9093 --producer.config /usr/local/kafka/config/admin.properties --topic kubatest

/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka-1.test-xxx.xxx:9093 --consumer.config /usr/local/kafka/config/admin.properties --topic  kubatest --consumer-property group.id=kubatest
TEST
^CProcessed a total of 1 messages
justCatchingRye commented 4 years ago

I've starting looking at this again with @alexanderphoenix as we'd put it on-hold while working on other tasks.

I've added the most permissive ACL setup I can think of and we still don't see any lag metrics in Prometheus for our customer apps, but we do if we create a test topic using kafka/bin scripts.

This is just an experimental configuration as I wanted to prune permissions once I had a working configuration. I'm only showing here the ACLs I've added. There are hundreds of other ACLs set (allow rules for topics, no deny rules), but nonetheless, these rules should allow kafka-minion to perform any operation on any topic including __consumer_offsets (as I understand it)

Can anyone see a reason why this wouldn't work (for lag on all topics)?

Current ACLs for resource `Group:LITERAL:*`:
    User:CN=REDACTED has Allow permission for operations: All from hosts: *

Current ACLs for resource `Topic:PREFIXED:*`:
    User:CN=REDACTED has Allow permission for operations: All from hosts: *

The pod logs seem to suggest that metric collection is occurring though:

{"level":"debug","module":"cluster","msg":"starting to collect topic offsets","time":"2020-10-22T12:03:42Z"}
...
{"level":"debug","module":"cluster","msg":"collected topic offsets","time":"2020-10-22T12:03:42Z"}

Thanks in advance for any help

sirkubax commented 4 years ago

Can anyone see a reason why this wouldn't work (for lag on all topics)? Thanks in advance for any help

I've added the most permissive ACL setup I can think of and we still don't see any lag metrics in Prometheus for our customer apps, but we do if we create a test topic using kafka/bin scripts

My guess here is, there is nothing to consume on a 'fresh', 'empty' kafka cluster, until you create some resource. Is that your case, is it an empty cluster?

Other thought, is there anything in /usr/local/kafka/logs/kafka-authorizer.log ?

justCatchingRye commented 4 years ago

Thanks for looking into this @sirkubax

I can say that it's a very active cluster. It has hundreds of topics on it and hundreds of ACL's which provide control over access to topics. There are no deny rules, which so far as I understand should mean that the ACL's I mentioned above are as permissive as possible (I think)

I tried applying a '--cluster' rule as well. I tried both operation=All (again trying to be maximally permissive for testing) and operation=Describe. In both case, it breaks consumer access and the following error is seen on a consumer connection

[2020-11-05 11:38:25,989] WARN [Consumer clientId=consumer-1, groupId=test-consumer-group-3] Offset commit failed on partition sandpit.demo.test_topic-0 at offset 5540: The request timed out. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

Referring back to the earlier problem description. If I consume from a topic using kafka-console-consumer.sh, then I see lag. For the customer topics, I don't see any lag. Those connections are coming from in-house java apps and are providing a consumer-group to the connection object. I can also see those consumer groups when I run the kafka-consumer-groups command

So, I agree that this seems like it could be related to permissions, but when I give permission to everything (that I know of), I still don't see the lag. It's a very strange situation.

I don't have access to the logs directly, but I'll see if I can get hold of the logs from AWS

sirkubax commented 4 years ago

@justCatchingRye Maybe your app uses older kafka api? or store offsets in zookeeper? And if so, only then you when you are consuming it with kafka-consumer-groups actually create that stats in kafka backend.

I think offsets.storage=kafka setting is responsible https://gitter.im/spring-projects/spring-kafka?at=5acce83f6bbe1d2739d0bd24

https://kafka.apache.org/22/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html Storing Offsets Outside Kafka

justCatchingRye commented 4 years ago

Thanks @sirkubax, I checked this with customers to try and find out if there is anything particular about their client setup. I'm also, asking the same of AWS on the server end

What I can say is that the kafka client apps that we are troubleshooting have no zookeeper configuration. They only have broker endpoints defined. Further to that they are network blocked from talking to zookeeper endpoints. They can only talk to kafka endpoints if they present a valid certificate and have appropriate ACL entries to permits access to particular topics

justCatchingRye commented 3 years ago

I've looked into this further. I don't think there can be any issue with where offsets are stored, our clients have no network access to zk. Lag monitoring (kafka-minion) works so long as there are no ACL's set against a topic. As soon ACLs are configured for a topic, lag monitoring stops.

For example, if I apply this set of rules so that I can consume and produce:

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=sandpit.demo.test_topic, patternType=LITERAL)`:
    (principal=User:CN=sampleAppUser, host=*, operation=READ, permissionType=ALLOW)
    (principal=User:CN=sampleAppUser, host=*, operation=CREATE, permissionType=ALLOW)
    (principal=User:CN=sampleAppUser, host=*, operation=WRITE, permissionType=ALLOW)
    (principal=User:CN=sampleAppUser, host=*, operation=DESCRIBE, permissionType=ALLOW)

As soon as this is applied, monitoring stops. If I remove the entries it works again.

If I add additional rules for monitoring, they either have no effect or they break kafka-minion entirely (see below error)

1) Added read and describe to all topics

kafka-acls.sh --authorizer-properties zookeeper.connect=$zookeeper --add --allow-principal User:CN=$cn --operation Read --operation Describe --group '*' --topic '*'

Leads to this issue in kafka-minion, which doesn't really make sense if this rule is supposed to give access to ALL topics...

{"error":"kafka server: The client is not authorized to access this topic.","level":"panic","msg":"failed to get partition count","time":"2020-11-18T08:03:17Z","topic":"__consumer_offsets"}

2) Added read and describe to __consumer_offsets

kafka-acls.sh --authorizer-properties zookeeper.connect=$zookeeper --add --allow-principal User:CN=$cn --operation Read --operation Describe --group '*' --topic '__consumer_offsets'

Error above is seen on startup

3) Add cluster-level describe

kafka-acls.sh --authorizer-properties zookeeper.connect=$zookeeper --add --allow-principal User:CN=$cn --operation DESCRIBE --group '*' --topic '*' --cluster

No effect

I think what I need is a reference implementation. Something that shows a topic that is both secured by ACLs AND able to be monitored. I'm working on this, but I've tried lots of combinations already... I see no reason why the configuration below doesn't work (and causes the error above on startup):

Current ACLs for resource `ResourcePattern(resourceType=GROUP, name=*, patternType=LITERAL)`:
    (principal=User:CN=sampleAppUser, host=*, operation=READ, permissionType=ALLOW)
    (principal=User:CN=kafkaMinionUser, host=*, operation=ALL, permissionType=ALLOW)

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=__consumer_offsets, patternType=LITERAL)`:
    (principal=User:CN=kafkaMinionUser, host=*, operation=ALL, permissionType=ALLOW)

Current ACLs for resource `ResourcePattern(resourceType=CLUSTER, name=kafka-cluster, patternType=LITERAL)`:
    (principal=User:CN=kafkaMinionUser, host=*, operation=DESCRIBE, permissionType=ALLOW)

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=*, patternType=LITERAL)`:
    (principal=User:CN=kafkaMinionUser, host=*, operation=ALL, permissionType=ALLOW)

Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=sandpit.demo.test_topic, patternType=LITERAL)`:
    (principal=User:CN=sampleAppUser, host=*, operation=READ, permissionType=ALLOW)
    (principal=User:CN=sampleAppUser, host=*, operation=CREATE, permissionType=ALLOW)
    (principal=User:CN=sampleAppUser, host=*, operation=WRITE, permissionType=ALLOW)
    (principal=User:CN=sampleAppUser, host=*, operation=DESCRIBE, permissionType=ALLOW)

If anyone can see anything amiss, I'd be happy for any feedback. I will also ask AWS for the auth logs for this setup.

weeco commented 3 years ago

@justCatchingRye THere are some managed offerings (e. g. Confluent Cloud) that use one kafka cluster for multiple customers. In that case you won't ever get access to the __consumer_offsets topic. I'm not sure if this applies to the AWS offering though.

justCatchingRye commented 3 years ago

Thanks @weeco, I'm thinking that because it works when there are no ACLs configured that this proves that we do have access to __consumer_offsets, unless i'm misunderstanding and making too much of an assumption there?

BitProcessor commented 3 years ago

Been playing around with ACL's to get Kafka-minion working after also getting the could not calculate partition lag because low water mark is missing message.

If you are using the Kafka GitOps project to manage your ACL's, then this is the set of customServiceAcls you need.

If you're using the command line kafka-acls tool, I'm sure you can translate the below in input for that tool.

  kafka-minion:
    acl1:
      name: __consumer_offsets
      type: TOPIC
      pattern: LITERAL
      host: "*"
      principal: User:kafka-minion
      operation: READ
      permission: ALLOW
    acl2:
      name: __consumer_offsets
      type: TOPIC
      pattern: LITERAL
      host: "*"
      principal: User:kafka-minion
      operation: DESCRIBE_CONFIGS
      permission: ALLOW
    acl3:
      name: kafka-cluster
      type: CLUSTER
      pattern: LITERAL
      principal: User:kafka-minion
      host: "*"
      operation: DESCRIBE
      permission: ALLOW
    acl4:
      name: "*"
      type: TOPIC
      pattern: LITERAL
      principal: User:kafka-minion
      host: "*"
      operation: DESCRIBE
      permission: ALLOW
    acl5:
      name: "*"
      type: TOPIC
      pattern: LITERAL
      principal: User:kafka-minion
      host: "*"
      operation: DESCRIBE_CONFIGS
      permission: ALLOW
weeco commented 3 years ago

Thanks for all your inputs regarding the ACLs. I hope that the V2 will return more descriptive errors when Kafka cluster requests fail due to lacking permissions. We replaced sarama with franz-go which seems to be a superior Kafka client.