kubewharf / kelemetry

Global control plane tracing for Kubernetes
Apache License 2.0
245 stars 29 forks source link

etcd is easily overloaded #195

Open u-kyou opened 12 months ago

u-kyou commented 12 months ago

Description

Kelemetry is perfectly meets our needs, and I have been running it for a few days. But one thing that confused me is the db size of etcd keeps increasing. It brings io pressure on the etcd server, and sometimes cause proposals pending. This could be due to a high number of k8s events(We have also optimized the data disk of etcd by using SSDs).

If Kelemetry could support event filtering, it would be a great help. For example, filter periodic events which I don't care about (Is redactPattern a similar filtering functionality? I had tried, but not works will)

User story

No response

SOF3 commented 12 months ago

Regarding "db size keeps increasing", there are some TTL values that you may want to tune them as well:

Would you provide more details on the "periodic events" that are causing problems for you? Could you try checking which types of keys have the highest traffic in your cluster? This information would help us in deciding if adding support for more robust database backends (e.g. redis, tikv, etc) is necessary.

u-kyou commented 11 months ago
  • diff-controller-redact-pattern tells the diff controller not to record the contents of an object. This is primarily used to prevent exposing secret contents to Kelemetry viewers directly.
  • filter-exclude-types tells all components (audit, diff, k8s-event) to ignore all events related to certain object type. This is primarily used to suppress noisy objects like leases (which update several times for each node every minute due to leader lease renewal) and events (we skip audit and diff for events because k8s-event will track them anyway)
  • filter-exclude-user-agent tells the audit consumer to ignore events from certain user agents, such as those that are explicitly for leader election.

Regarding "db size keeps increasing", there are some TTL values that you may want to tune them as well:

  • diff-cache-patch-ttl, diff-cache-snapshot-ttl: this is the time that a cached object diff remains in the database after its create/update/delete event so that the audit consumer can read it. This can be set to a smaller value if your audit webhook/producer delivers events to the consumer quickly enough.

Would you provide more details on the "periodic events" that are causing problems for you? Could you try checking which types of keys have the highest traffic in your cluster? This information would help us in deciding if adding support for more robust database backends (e.g. redis, tikv, etc) is necessary.

Thank you very much for your reply! @SOF3

The "periodic events" I mentioned above: We use many CRDs which could produce many events periodically, for example:

apisixupstreams-event

I tried filter-exclude-types to exclude apisix.apache.org/apisixupstreams, and it works will. That's exactly what I need.

I also tried to set a smaller value for diff-cache-patch-ttl and diff-cache-snapshot-ttl these two params, the total number of etcd keys decreased significantly. After I did a 'defrag' to etcd server, the db size has shrinked to a reasonable value. But from our etcd's monitoring, the write iops did not decrease and there are still a few proposals pending (maybe adding support for redis is a good idea?!)

etcd-proposals_pending