Closed connorworkman closed 1 year ago
Hi @connorworkman, thanks so much for the detailed report! What does your networks classifications block currently look like? Are these IPs potentially having their classification overridden by those rules?
These errors just indicate that prometheus was temporarily down. You're right that this shouldn't be related to this issue, but they shouldn't by repeated too frequently...
@dwbrown2 we haven't declared any overrides for network classifcation, but I was partly wondering if we needed to in order to get this to work... Since we're running a multi-AZ kubernetes cluster, it's hard to tell what we'd designate as in-zone since there's no telling where any kubernetes node will be placed and it should be relative to each pod.
Here are the helm chart values we're using:
reporting:
logCollection: false
productAnalytics: false
errorReporting: false
valuesReporting: false
kubecostProductConfigs:
awsSpotDataRegion: us-east-1
awsSpotDataBucket: <redacted>
awsSpotDataPrefix: dev
networkCosts:
enabled: true
We're running on EKS 1.18.3 with one of the most recent Amazon-provided worker node AMIs for EKS 1.18; using the Amazon CNI (amazon-k8s-cni:v1.7.5). At this point I've seen some ext/internet egress labeled appropriately... but everything within the VPC still shows as in-zone despite being cross-AZ in most cases. Let me know if you have any suggestions and I appreciate your reply!
The connection errors were likely from around a helm upgrade to apply updated values, the only frequent errors we're seeing seem to be expected since we're not using a csv/bucket for price lists or athena.
I0317 20:33:14.116551 1 router.go:330] Error returned to client: MissingRegion: could not find region configuration
I0317 20:43:39.378114 1 router.go:330] Error returned to client: MissingRegion: could not find region configuration
E0317 21:19:22.859855 1 log.go:17] [Error] Asset ETL: CloudAssets[FgHEB]: QueryAssetSetRange error: AthenaTable not configured
E0317 21:19:22.860437 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0317 22:19:22.860679 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0317 23:19:22.860899 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 00:19:22.860088 1 log.go:17] [Error] Asset ETL: CloudAssets[FgHEB]: QueryAssetSetRange error: AthenaTable not configured
E0318 00:19:22.861025 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 01:19:22.861228 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 02:19:22.861435 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 03:19:22.860336 1 log.go:17] [Error] Asset ETL: CloudAssets[FgHEB]: QueryAssetSetRange error: AthenaTable not configured
E0318 03:19:22.861578 1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
@connorworkman I think I understand the problem here, likely related to kubernetes version. Just so there is clarity here, we start by categorizing traffic into two immediate categories:
Pod-B
running on Node-1
, then that's considered a resolvable destination.By default, every destination in [2] is categorized as internet
before it's tested against the configurable filters (which override the default).
Every destination in [1] is categorized by default as in-zone
before the source and destination nodes region/zone data is compared.
I believe that recently, a kubernetes update changed the labels that are used to set region and zone on the Node
instances. We've made compatibility changes in cost-model, but I believe the network costs is still using the legacy labels:
"failure-domain.beta.kubernetes.io/zone"
"failure-domain.beta.kubernetes.io/region"
If you look at the labels on one of your Node
instances, I'm assuming they're using:
"topology.kubernetes.io/region"
"topology.kubernetes.io/zone"
In summary, our classifier is using region/zone labels from an older kubernetes version, which are returning ""
-- since we default to in-zone
, this is carrying through creating inaccuracies. The solution is likely to add support for the new labels and continue to support the legacy labels.
If you can confirm, patching in a fix here shouldn't be too much of an issue. Thanks for your report and great catch spotting the inaccurate classification!
@connorworkman I've released network-costs v15.2
if you want to update your daemonset image to point to this version, it should take care of this issue for you. Let us know if there are any further issues!
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: c4.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1d
kubernetes.io/arch: amd64
kubernetes.io/hostname: ip-10-1-248-85.<redacted>.com
kubernetes.io/os: linux
lifecycle: spot
node.kubernetes.io/instance-type: c4.xlarge
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1d
Hmm, looks like we have both sets of labels on all nodes. I've updated the kubecost-network-costs daemonset image to 15.2 just now and bounced the cost-analyzer pod for good measure -- unfortunately still seeing everything as "in zone."
Haven't had a chance to dig much deeper yet; is there a specific container log I can post that might help?
The network-costs pods will have destination traffic logs which could possibly help. However, I'm wondering how you're coming up with your diagnosis. If you're using spot nodes, are they guaranteed to stick to a specific region/zone? If our pods are classifying this traffic as in-zone, it specifically resolves pods to nodes, and extracts the region/zone from the nodes -- the source and destination nodes would have to be different to refute the in-zone classification.
If you're using spot nodes, are they guaranteed to stick to a specific region/zone?
When a spot node launches in our cluster(s) it automagically assigns itself those labels depending on the AZ/region/subnet it's launched in. So (most) pods aren't guaranteed to stay in any one zone, but the nodes are guaranteed to label themselves appropriately and IPs are locked to their respective subnets/AZs.
The subnet CIDR blocks for the kube nodes in this cluster are
Here's an example from the logs of one of the kubecost-network-costs pods where source IP is in a subnet in us-east-1b and destination includes an IP in both us-east-1c and us-east-1d:
I0318 19:23:18.166873 1 networktrafficlogger.go:76] Source: 10.1.212.211
I0318 19:23:18.166877 1 networktrafficlogger.go:77] [nonprod-elasticsearch-master-2,redacted-locust-master-255rx]
I0318 19:23:18.166880 1 networktrafficlogger.go:80] -> Dest: 10.1.222.198, [RZ] Total Bytes: 9685483, Total GB: 0.01
I0318 19:23:18.166884 1 networktrafficlogger.go:81] [nonprod-elasticsearch-data-0]
I0318 19:23:18.166886 1 networktrafficlogger.go:80] -> Dest: 10.1.241.86, [RZ] Total Bytes: 27456, Total GB: 0.00
I0318 19:23:18.166890 1 networktrafficlogger.go:81] [nonprod-elasticsearch-master-0]
I0318 19:23:18.166894 1 networktrafficlogger.go:80] -> Dest: 10.1.231.1, [RZ] Total Bytes: 26886, Total GB: 0.00
I0318 19:23:18.166899 1 networktrafficlogger.go:81] [nonprod-elasticsearch-data-1]
and here's the routing table as shown by the netroutes.go logs from the same pod:
I0318 20:23:18.172328 1 netroutes.go:61] +----------------------- Routing Table -----------------------------
I0318 20:23:18.172340 1 netroutes.go:63] | Destination: 0.0.0.0, Route: 10.1.208.1
I0318 20:23:18.172344 1 netroutes.go:63] | Destination: 10.1.211.16, Route: 0.0.0.0
I0318 20:23:18.172347 1 netroutes.go:63] | Destination: 10.1.215.239, Route: 0.0.0.0
I0318 20:23:18.172350 1 netroutes.go:63] | Destination: 10.1.221.39, Route: 0.0.0.0
I0318 20:23:18.172354 1 netroutes.go:63] | Destination: 10.1.210.236, Route: 0.0.0.0
I0318 20:23:18.172358 1 netroutes.go:63] | Destination: 10.1.217.86, Route: 0.0.0.0
I0318 20:23:18.172362 1 netroutes.go:63] | Destination: 10.1.221.252, Route: 0.0.0.0
I0318 20:23:18.172366 1 netroutes.go:63] | Destination: 10.1.223.10, Route: 0.0.0.0
I0318 20:23:18.172370 1 netroutes.go:63] | Destination: 169.254.169.254, Route: 0.0.0.0
I0318 20:23:18.172373 1 netroutes.go:63] | Destination: 10.1.208.0, Route: 0.0.0.0
I0318 20:23:18.172375 1 netroutes.go:63] | Destination: 10.1.216.233, Route: 0.0.0.0
I0318 20:23:18.172378 1 netroutes.go:63] | Destination: 10.1.212.211, Route: 0.0.0.0
I0318 20:23:18.172382 1 netroutes.go:63] | Destination: 10.1.214.227, Route: 0.0.0.0
I0318 20:23:18.172386 1 netroutes.go:63] | Destination: 10.1.215.57, Route: 0.0.0.0
I0318 20:23:18.172391 1 netroutes.go:63] | Destination: 10.1.222.214, Route: 0.0.0.0
I0318 20:23:18.172395 1 netroutes.go:65] +-------------------------------------------------------------------
Appreciate you looking into this by the way. Our subnet blocks are pretty static, so I wouldn't necessarily be opposed to declaring them all in the configs in the direct-classification
overrides if need be.
Thanks for the direct routes logs - those are normally used when the IP address is unrecognized (since they're supplied by the CNI implementation). It sounds like the symptoms of the problem are the same, but I just missed the mark on the actual cause. For now, it's probably best to add direct-classification for those CIDR blocks, and I'll dive into the classification ASAP and try and locate the actual source of the issue. Thanks again for the information and feedback here!
Will update when I find out more.
Seems like the answer might be in the default configs for networkCosts.config.destinations
which gets injected into the network-costs-config configmap:
config:
# Configuration for traffic destinations, including specific classification
# for IPs and CIDR blocks. This configuration will act as an override to the
# automatic classification provided by network-costs.
destinations:
# In Zone contains a list of address/range that will be
# classified as in zone.
in-zone:
# Loopback
- "127.0.0.1"
# IPv4 Link Local Address Space
- "169.254.0.0/16"
# Private Address Ranges in RFC-1918
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
This seems to assume that everything in the private address range is in-zone, which is definitely not the case for us. Overriding the in-zone addresses to just loopback, and then adding the rest of the subnets to the direct-classifications is working.
I can try overriding the in-zone configs alone to see if that was the only issue.
The automatic classification seems to be working with just the in-zone override:
$ perfkubectl get cm -n kubecost network-costs-config -o yaml
apiVersion: v1
data:
config.yaml: |
destinations:
cross-region: []
direct-classification: []
in-region: []
in-zone:
- 127.0.0.1/32
Scratch that, now it's classifying some cross-zone traffic as internet traffic, so I'm going back to direct classifications (which is fine by me).
Seems like the answer might be in the default configs for
networkCosts.config.destinations
which gets injected into the network-costs-config configmap:config: # Configuration for traffic destinations, including specific classification # for IPs and CIDR blocks. This configuration will act as an override to the # automatic classification provided by network-costs. destinations: # In Zone contains a list of address/range that will be # classified as in zone. in-zone: # Loopback - "127.0.0.1" # IPv4 Link Local Address Space - "169.254.0.0/16" # Private Address Ranges in RFC-1918 - "10.0.0.0/8" - "172.16.0.0/12" - "192.168.0.0/16"
This seems to assume that everything in the private address range is in-zone, which is definitely not the case for us. Overriding the in-zone addresses to just loopback, and then adding the rest of the subnets to the direct-classifications is working.
I can try overriding the in-zone configs alone to see if that was the only issue.
Ok, this is actually beginning to make a bit more sense now. I had already spent some time trying to get my tests to break for specific combinations of inputs without any success, so I'm glad you were able to narrow this down. I forgot that 10.0.0.0/8
was in-zone mapped. This actually seems quite dubious after some thought, so I believe some additional documentation should be added here. I'm hesitant to completely remove, but I will continue to give this some thought. Thanks again for all your input here! I'm going to add more documentation so that we can put this to rest for the time being.
@mbolt35 , any chance you were able to enhance the documentation and we can close this out?
@kirbsauce Nothing too advanced: https://github.com/kubecost/cost-analyzer-helm-chart/pull/1014
This issue has been marked as stale because it has not had recent activity. It will be closed if no further action occurs.
Describe the bug When running cost-analyzer with networkCosts.enabled=true, everything seems to be working as far as metric collection and traffic monitoring, but the cost reports show all traffic as "in zone" even when a pod shows traffic history to kubernetes nodes in different AWS availability zones. Basically, the automatic classification provided by network-costs is incorrect.
To Reproduce Steps to reproduce the behavior:
Expected behavior The pod traffic history should show "In Region" if traffic is destined for a kubernetes node within the same cluster but residing in a different availability zone.
Screenshots https://imgur.com/vDf62fB
Collect logs (please complete the following information):
Run
helm ls
and paste the output here:Run
kubectl logs <kubecost-cost-analyzer pod name> -n kubecost -c cost-analyzer-server
and paste the output here:Run
kubectl logs <kubecost-cost-analyzer pod name> -n kubecost -c cost-model
and paste the output here:I do see some connection errors on the cost-model container, but not sure they're relevant to the automatic zone/region classification issue we're having.