Open nitishchandrapatil opened 10 months ago
So your cluster has a mix of EC2 nodes and Fargate nodes?
Easiest solution today would be to disable Node Exporter and utilize the Kubernetes API log gathering (which removes the Daemonset for Grafana Agent for Logs): https://github.com/grafana/k8s-monitoring-helm/tree/main/examples/eks-fargate
I'll look into other solutions for your situation.
Can you try this? We might consider applying it by default for grafana-agent-logs by default:
grafana-agent-logs:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
Can you try this? We might consider applying it by default for grafana-agent-logs by default:
grafana-agent-logs: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: eks.amazonaws.com/compute-type operator: NotIn values: - fargate
Missing controller
grafana-agent-logs:
controller:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
Reopening. The PR I just merged sets the affinity rules for Node Exporter only. The reason I didn't do so for grafana-agent-logs is because it makes it harder for pure fargate clusters to make this work because they would need to undo the default affinity in order to get a single pod scheduled, even as a deployment.
Cool. Thanks for reopening the ticket. I will soon test it once a resolution is found for this :) .
Here's the trick, and why this still remains unresolved.
Node Exporter is simple. It does not go on Fargate nodes. You don't get node metrics for those nodes, but you likely don't care about them. That's AWS' problem. The PR that I merged last week sets the affinity rule to avoid fargate nodes. Done.
The other daemonset is the Grafana Agent for gathering logs. In daemonset mode, along with logs.pod_logs.gatherMethod=volumes
, the only way we gather logs is by being on the same node as the pods. That means, if we apply the same affinity rule, we then lose pod logs for pods on fargate nodes.
The workaround, especially for fargate-only clusters, was to set the agent to be a deployment and set logs.pod_logs.gatherMethod=api
.
I was working on a "hybrid", where you could use volumes in a daemonset, but the instances would use the API to gather pod logs from pods on fargate or windows nodes. But there's a problem with memory and cpu consumption when trying to discover those pods, especially on very large clusters. It was a non-starter.
I think the ultimate solution for now will be to use logs.pod_logs.gatherMethod=api
, set the affinity rule, and the controller type to deployment.
@petewall, hey!
I think the ultimate solution for now will be to use logs.pod_logs.gatherMethod=api, set the affinity rule, and the controller type to deployment.
But is the nodeAffinity
supported, though? The following doesn't seem to work:
# Settings related to capturing and forwarding logs
logs:
# -- Capture and forward logs
enabled: true
# Settings for Kubernetes pod logs
pod_logs:
# -- Capture and forward logs from Kubernetes pods
enabled: true
# -- Controls the behavior of gathering pod logs.
# When set to "volumes", the Grafana Agent will use HostPath volume mounts on the cluster nodes to access the pod
# log files directly.
# When set to "api", the Grafana Agent will access pod logs via the API server. This method may be preferable if
# your cluster prevents DaemonSets, HostPath volume mounts, or for other reasons.
gatherMethod: "api"
grafana-agent-logs:
agent:
# Enable clustering by default to make it simpler when using API-based log gathering.
clustering: {enabled: true}
mounts:
# Mount /var/log from the host into the container for log collection.
varlog: false
controller:
replicas: 2
type: deployment
# NB(khaykingleb): don't schedule the Grafana Agent on Fargate nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
Since
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
k8s-monitoring-grafana-agent-0 2/2 Running 0 3m34s
k8s-monitoring-grafana-agent-logs-295mx 0/2 Pending 0 3m34s
k8s-monitoring-grafana-agent-logs-2cwlv 2/2 Running 0 3m34s
k8s-monitoring-grafana-agent-logs-729rs 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-894j7 0/2 Pending 0 3m34s
k8s-monitoring-grafana-agent-logs-c6psl 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-jpvq9 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-kgtfq 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-klqcg 0/2 Pending 0 3m34s
k8s-monitoring-grafana-agent-logs-lqkjg 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-lttq4 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-pqn89 2/2 Running 0 3m33s
k8s-monitoring-grafana-agent-logs-qmjll 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-rxjpb 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-vw75z 0/2 Pending 0 3m34s
k8s-monitoring-grafana-agent-logs-w8kzz 0/2 Pending 0 3m33s
k8s-monitoring-grafana-agent-logs-x4hrk 2/2 Running 0 3m33s
k8s-monitoring-grafana-agent-logs-x4l7t 0/2 Pending 0 3m34s
k8s-monitoring-grafana-agent-logs-xxlxs 0/2 Pending 0 3m34s
k8s-monitoring-kube-state-metrics-556fd97bdd-g8msd 1/1 Running 0 3m34s
k8s-monitoring-prometheus-node-exporter-cb5wh 1/1 Running 0 3m34s
k8s-monitoring-prometheus-node-exporter-cstmr 1/1 Running 0 3m34s
k8s-monitoring-prometheus-node-exporter-hlknf 1/1 Running 0 3m34s
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fargate-ip-XXX.ec2.internal Ready <none> 10m v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 27h v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 17d v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 6d4h v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 10m v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 40m v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 13d v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 40m v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 27h v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 7d1h v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 40m v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 17d v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 4d21h v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 40m v1.28.5-eks-680e576
fargate-ip-XXX.ec2.internal Ready <none> 14d v1.28.5-eks-680e576
ip-XXX.ec2.internal Ready <none> 82d v1.28.5-eks-5e0fdde
ip-XXX.ec2.internal Ready <none> 41d v1.28.5-eks-5e0fdde
ip-XXX.ec2.internal Ready <none> 82d v1.28.5-eks-5e0fdde
$ kubectl describe node fargate-ip-XXX.ec2.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
eks.amazonaws.com/compute-type=fargate
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1c
kubernetes.io/arch=amd64
kubernetes.io/hostname=fargate-ip-XXX.ec2.internal
kubernetes.io/os=linux
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1c
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
Taints: eks.amazonaws.com/compute-type=fargate:NoSchedule
Unschedulable: false
@khaykingleb I think your indent of controller
is incorrect.
grafana-agent-logs:
agent:
controller:
should be:
grafana-agent-logs:
agent:
controller:
Because indent of controller
in grafana-agent helm chart is:
agent:
...
controller:
...
Oh, indeed. Thank you for pointing this out!
In the same vein, we need:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
supported on the profiles daemonset as well. I don't see an option under https://github.com/grafana/k8s-monitoring-helm/blob/main/charts/k8s-monitoring/values.yaml#L832.
@petewall seems like chart changed a bit since this was originally opened as grafana-agent-logs:
doesn't seem to exist in the chart anymore.
EDIT: I changed it to alloy-logs
and it worked.
I tried adding it on alloy_logs:
but doesn't seem to work:
cluster:
name: ${var.cluster_name}
alloy_logs:
controller:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
Currently we are using EKS to run our workloads which also has both Fargate profiles and ec2 instances. While trying to deploy the helm chart, its trying to schedule on Fargate as well. Tried using NodeAffinity to prevent it but realised there isn't a support for that. Not able to understand how to proceed from here. This issue is currently blocking the implementation of monitoring solution on our environment.