vectordotdev / helm-charts

Helm charts for Vector.
https://vector.dev
Mozilla Public License 2.0
104 stars 90 forks source link

When running Vector in Agent mode, the Agent should tolerate all taints by default #271

Closed hawkesn closed 1 year ago

hawkesn commented 1 year ago

In Agent mode, it is designed to retrieve logs on all nodes via a Daemon Set. If there are taints on nodes, you would need to add tolerations on the Agent pod to ensure those nodes' logs are also getting grabbed.

By default, the Agent should tolerate all taints. This can be done by adding the following:

  tolerations:
    - operator: "Exists"
hawkesn commented 1 year ago

I opened up: #272 as a PR

spencergilbert commented 1 year ago

Hi @hawkesn, looking at other charts for applications in this space it seems as though they also default to no tolerations defined (one exception). Do you have any examples of this default?

  1. datadog - default tolerations: []
  2. fluentd - default tolerations: []
  3. fluent-bit - default tolerations: []
  4. promtail - default tolerations: [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/control-plane","operator":"Exists"}]

I am trying to think of situations where you wouldn't want your telemetry collector to be on all nodes, including nodes that are being drained/cordoned - such that you can continue to ship telemetry off node for as long as possible 🤔

hawkesn commented 1 year ago

That is an excellent question!

I dug into a few other helm charts (eg. Node Termination Handler) that run as a daemon set and, you're right, those also do not have default tolerations set.

I did find this option: tolerateAllTaints as a value in aws-ebs-csi-driver: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/templates/node.yaml#L39-L46

This one also has no default tolerations set. Albeit, having that especially called-out in the values.yaml made it explicitly clear to me I should turn that on 👍.

To go against my own issue, one situation is if you're logging traffic via AWS ALB into a S3 bucket and have an "ingress" node pool receiving traffic from the ALB that runs istio/traefik/etc. pods. I would have set taints to prevent any workload pods from going onto those nodes. In this scenario, you would be duplicating logging twice and it can get especially expensive if you send all those logs to datadog as well.

For the situation above, it would also be troublesome for those already using Vector to upgrade to this change and suddenly get a massive bill because more nodes had the logging turned on.

I will be closing this issue since it does make sense to not have that on by default.