The coordinator thread should be able to finish any event in less than the configured heartbeat period (default 1 minute). Lately it has been observed that all the partition assignment events are taking more than approximately 1.5 minutes to complete for every request for large clusters with around ~500K partitions per datastream.
The issue seems to be related to this code where the thread is stuck in the removeAll call, where one of the collections is a list. This may result in higher CPU usage.
This has been confirmed with thread dumps and logs from a partition heavy cluster's performance.
Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.
Summary
The coordinator thread should be able to finish any event in less than the configured heartbeat period (default 1 minute). Lately it has been observed that all the partition assignment events are taking more than approximately 1.5 minutes to complete for every request for large clusters with around ~500K partitions per datastream.
The issue seems to be related to this code where the thread is stuck in the removeAll call, where one of the collections is a list. This may result in higher CPU usage.
This has been confirmed with thread dumps and logs from a partition heavy cluster's performance.
Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.
Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md