Network traffic between AWS availability zones shows as "in zone" between k8s nodes in different AZs

connorworkman commented 3 years ago

Describe the bug When running cost-analyzer with networkCosts.enabled=true, everything seems to be working as far as metric collection and traffic monitoring, but the cost reports show all traffic as "in zone" even when a pod shows traffic history to kubernetes nodes in different AWS availability zones. Basically, the automatic classification provided by network-costs is incorrect.

To Reproduce Steps to reproduce the behavior:

Deploy the cost-analyzer with networkCosts.enabled=true in kubernetes within AWS
Confirm network traffic metrics are being collected in prometheus.
Check cost breakdown for a pod that communicates with pods in other AZs.
Observe that traffic to network interfaces in different AZs show as "in zone."

Expected behavior The pod traffic history should show "In Region" if traffic is destined for a kubernetes node within the same cluster but residing in a different availability zone.

Screenshots https://imgur.com/vDf62fB

Collect logs (please complete the following information):

Run helm ls and paste the output here:

helm ls -n kubecost
NAME        NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                   APP VERSION
kubecost    kubecost    12          2021-03-16 16:39:31.020408 -0500 CDT    deployed    cost-analyzer-1.74.0    1.74.0

Run kubectl logs <kubecost-cost-analyzer pod name> -n kubecost -c cost-analyzer-server and paste the output here:

$ kubectl logs kubecost-cost-analyzer-5d89f55c6d-kw266 -n kubecost cost-analyzer-server

2021/03/16 21:19:21 Thanos is disabled for the API endpoints.
time="2021-03-16T21:19:21Z" level=info msg="start event store..." source="store.go:111"
2021/03/16 21:19:21 Server is ready.

Run kubectl logs <kubecost-cost-analyzer pod name> -n kubecost -c cost-model and paste the output here:

I do see some connection errors on the cost-model container, but not sure they're relevant to the automatic zone/region classification issue we're having.

E0316 21:29:49.518884       1 log.go:17] [Error] ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22true%22%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024": dial tcp 172.20.170.248:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true"}[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
E0316 21:29:49.518887       1 log.go:17] [Error] ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true"}[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
E0316 21:29:49.518894       1 log.go:17] [Error] ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22false%22%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024": dial tcp 172.20.170.248:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false"}[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
E0316 21:29:49.518901       1 log.go:17] [Error] ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false"}[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
E0316 21:29:49.518909       1 log.go:17] [Error] ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22true%22%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024": dial tcp 172.20.170.248:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="true"}[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
E0316 21:29:49.518913       1 log.go:17] [Error] ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="true"}[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
E0316 21:29:49.518920       1 log.go:17] [Error] ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=max%28count_over_time%28kube_pod_container_resource_requests_memory_bytes%7B%7D%5B2m%5D+%29%29": dial tcp 172.20.170.248:80: connect: connection refused' fetching query 'max(count_over_time(kube_pod_container_resource_requests_memory_bytes{}[2m] ))'
E0316 21:29:49.518924       1 log.go:17] [Error] ComputeCostData: Parsing Error: Prometheus communication error: max(count_over_time(kube_pod_container_resource_requests_memory_bytes{}[2m] ))
E0316 21:29:49.518930       1 log.go:17] [Error] Error in price recording: 8 errors occurred
E0316 21:30:03.630825       1 log.go:17] [Error] CostDataRange: Request Error: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2021-03-16T21%3A28%3A22.981901509Z&query=%0A%09%09label_replace%28label_replace%28%0A%09%09%09sum%28%0A%09%09%09%09sum_over_time%28container_memory_allocation_bytes%7Bcontainer%21%3D%22%22%2Ccontainer%21%3D%22POD%22%2C+node%21%3D%22%22%7D%5B1h%5D%29%0A%09%09%09%29+by+%28namespace%2Ccontainer%2Cpod%2Cnode%2Ccluster_id%29+%2A+60.000000+%2F+60+%2F+60%0A%09%09%2C+%22container_name%22%2C%22%241%22%2C%22container%22%2C%22%28.%2B%29%22%29%2C+%22pod_name%22%2C%22%241%22%2C%22pod%22%2C%22%28.%2B%29%22%29&start=2021-03-15T21%3A28%3A22.981901509Z&step=3600.000": dial tcp 172.20.170.248:80: connect: connection refused, Body:  Query: 
                label_replace(label_replace(
                        sum(
                                sum_over_time(container_memory_allocation_bytes{container!="",container!="POD", node!=""}[1h])
                        ) by (namespace,container,pod,node,cluster_id) * 60.000000 / 60 / 60
                , "container_name","$1","container","(.+)"), "pod_name","$1","pod","(.+)")
E0316 21:30:03.630844       1 log.go:17] [Error] CostDataRange: Parsing Error: Prometheus communication error: 
                label_replace(label_replace(
                        sum(
                                sum_over_time(container_memory_allocation_bytes{container!="",container!="POD", node!=""}[1h])
                        ) by (namespace,container,pod,node,cluster_id) * 60.000000 / 60 / 60
                , "container_name","$1","container","(.+)"), "pod_name","$1","pod","(.+)")
E0316 21:30:03.630855       1 log.go:17] [Error] CostDataRange: Request Error: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2021-03-16T21%3A28%3A22.981901509Z&query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09avg%28%0A%09%09%09%09%09count_over_time%28kube_pod_container_resource_requests_memory_bytes%7Bcontainer%21%3D%22%22%2Ccontainer%21%3D%22POD%22%2C+node%21%3D%22%22%7D%5B1h%5D+%29%0A%09%09%09%09%09%2A%0A%09%09%09%09%09avg_over_time%28kube_pod_container_resource_requests_memory_bytes%7Bcontainer%21%3D%22%22%2Ccontainer%21%3D%22POD%22%2C+node%21%3D%22%22%7D%5B1h%5D+%29%0A%09%09%09%09%29+by+%28namespace%2Ccontainer%2Cpod%2Cnode%2Ccluster_id%29+%2C+%22container_name%22%2C%22%241%22%2C%22container%22%2C%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C%22%241%22%2C%22pod%22%2C%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2Ccontainer_name%2Cpod_name%2Cnode%2Ccluster_id%29&start=2021-03-15T21%3A28%3A22.981901509Z&step=3600.000": dial tcp 172.20.170.248:80: connect: connection refused, Body:  Query: avg(
...

dwbrown2 commented 3 years ago

Hi @connorworkman, thanks so much for the detailed report! What does your networks classifications block currently look like? Are these IPs potentially having their classification overridden by those rules?

These errors just indicate that prometheus was temporarily down. You're right that this shouldn't be related to this issue, but they shouldn't by repeated too frequently...

connorworkman commented 3 years ago

@dwbrown2 we haven't declared any overrides for network classifcation, but I was partly wondering if we needed to in order to get this to work... Since we're running a multi-AZ kubernetes cluster, it's hard to tell what we'd designate as in-zone since there's no telling where any kubernetes node will be placed and it should be relative to each pod.

Here are the helm chart values we're using:

reporting:
  logCollection: false
  productAnalytics: false
  errorReporting: false
  valuesReporting: false
kubecostProductConfigs:
  awsSpotDataRegion: us-east-1
  awsSpotDataBucket: <redacted>
  awsSpotDataPrefix: dev
networkCosts:
  enabled: true

We're running on EKS 1.18.3 with one of the most recent Amazon-provided worker node AMIs for EKS 1.18; using the Amazon CNI (amazon-k8s-cni:v1.7.5). At this point I've seen some ext/internet egress labeled appropriately... but everything within the VPC still shows as in-zone despite being cross-AZ in most cases. Let me know if you have any suggestions and I appreciate your reply!

connorworkman commented 3 years ago

The connection errors were likely from around a helm upgrade to apply updated values, the only frequent errors we're seeing seem to be expected since we're not using a csv/bucket for price lists or athena.

I0317 20:33:14.116551       1 router.go:330] Error returned to client: MissingRegion: could not find region configuration
I0317 20:43:39.378114       1 router.go:330] Error returned to client: MissingRegion: could not find region configuration
E0317 21:19:22.859855       1 log.go:17] [Error] Asset ETL: CloudAssets[FgHEB]: QueryAssetSetRange error: AthenaTable not configured
E0317 21:19:22.860437       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0317 22:19:22.860679       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0317 23:19:22.860899       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 00:19:22.860088       1 log.go:17] [Error] Asset ETL: CloudAssets[FgHEB]: QueryAssetSetRange error: AthenaTable not configured
E0318 00:19:22.861025       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 01:19:22.861228       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 02:19:22.861435       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured
E0318 03:19:22.860336       1 log.go:17] [Error] Asset ETL: CloudAssets[FgHEB]: QueryAssetSetRange error: AthenaTable not configured
E0318 03:19:22.861578       1 log.go:17] [Error] Asset ETL: Reconciliation[mcCaV]: QueryAssetSetRange error: No Athena Bucket configured

mbolt35 commented 3 years ago

@connorworkman I think I understand the problem here, likely related to kubernetes version. Just so there is clarity here, we start by categorizing traffic into two immediate categories:

Destinations we can resolve -- if we can resolve the destination to Pod-B running on Node-1, then that's considered a resolvable destination.
Destinations we can't resolve, or can't be resolved via Kubernetes API or node route tables.

By default, every destination in [2] is categorized as internet before it's tested against the configurable filters (which override the default).

Every destination in [1] is categorized by default as in-zone before the source and destination nodes region/zone data is compared.

I believe that recently, a kubernetes update changed the labels that are used to set region and zone on the Node instances. We've made compatibility changes in cost-model, but I believe the network costs is still using the legacy labels:

"failure-domain.beta.kubernetes.io/zone"
"failure-domain.beta.kubernetes.io/region"

If you look at the labels on one of your Node instances, I'm assuming they're using:

"topology.kubernetes.io/region"
"topology.kubernetes.io/zone"

In summary, our classifier is using region/zone labels from an older kubernetes version, which are returning "" -- since we default to in-zone, this is carrying through creating inaccuracies. The solution is likely to add support for the new labels and continue to support the legacy labels.

If you can confirm, patching in a fix here shouldn't be too much of an issue. Thanks for your report and great catch spotting the inaccurate classification!

mbolt35 commented 3 years ago

@connorworkman I've released network-costs v15.2 if you want to update your daemonset image to point to this version, it should take care of this issue for you. Let us know if there are any further issues!

connorworkman commented 3 years ago

    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/instance-type: c4.xlarge
      beta.kubernetes.io/os: linux
      failure-domain.beta.kubernetes.io/region: us-east-1
      failure-domain.beta.kubernetes.io/zone: us-east-1d
      kubernetes.io/arch: amd64
      kubernetes.io/hostname: ip-10-1-248-85.<redacted>.com
      kubernetes.io/os: linux
      lifecycle: spot
      node.kubernetes.io/instance-type: c4.xlarge
      topology.kubernetes.io/region: us-east-1
      topology.kubernetes.io/zone: us-east-1d

Hmm, looks like we have both sets of labels on all nodes. I've updated the kubecost-network-costs daemonset image to 15.2 just now and bounced the cost-analyzer pod for good measure -- unfortunately still seeing everything as "in zone."

Haven't had a chance to dig much deeper yet; is there a specific container log I can post that might help?

mbolt35 commented 3 years ago

The network-costs pods will have destination traffic logs which could possibly help. However, I'm wondering how you're coming up with your diagnosis. If you're using spot nodes, are they guaranteed to stick to a specific region/zone? If our pods are classifying this traffic as in-zone, it specifically resolves pods to nodes, and extracts the region/zone from the nodes -- the source and destination nodes would have to be different to refute the in-zone classification.

connorworkman commented 3 years ago

If you're using spot nodes, are they guaranteed to stick to a specific region/zone?

When a spot node launches in our cluster(s) it automagically assigns itself those labels depending on the AZ/region/subnet it's launched in. So (most) pods aren't guaranteed to stay in any one zone, but the nodes are guaranteed to label themselves appropriately and IPs are locked to their respective subnets/AZs.

The subnet CIDR blocks for the kube nodes in this cluster are

10.1.192.0/20 (us-east-1a)
10.1.208.0/20 (us-east-1b)
10.1.224.0/20 (us-east-1c)
10.1.240.0/20 (us-east-1d)

Here's an example from the logs of one of the kubecost-network-costs pods where source IP is in a subnet in us-east-1b and destination includes an IP in both us-east-1c and us-east-1d:

I0318 19:23:18.166873       1 networktrafficlogger.go:76] Source: 10.1.212.211
I0318 19:23:18.166877       1 networktrafficlogger.go:77] [nonprod-elasticsearch-master-2,redacted-locust-master-255rx]
I0318 19:23:18.166880       1 networktrafficlogger.go:80]   -> Dest: 10.1.222.198, [RZ] Total Bytes: 9685483, Total GB: 0.01
I0318 19:23:18.166884       1 networktrafficlogger.go:81]      [nonprod-elasticsearch-data-0]
I0318 19:23:18.166886       1 networktrafficlogger.go:80]   -> Dest: 10.1.241.86, [RZ] Total Bytes: 27456, Total GB: 0.00
I0318 19:23:18.166890       1 networktrafficlogger.go:81]      [nonprod-elasticsearch-master-0]
I0318 19:23:18.166894       1 networktrafficlogger.go:80]   -> Dest: 10.1.231.1, [RZ] Total Bytes: 26886, Total GB: 0.00
I0318 19:23:18.166899       1 networktrafficlogger.go:81]      [nonprod-elasticsearch-data-1]

and here's the routing table as shown by the netroutes.go logs from the same pod:

I0318 20:23:18.172328       1 netroutes.go:61] +----------------------- Routing Table -----------------------------
I0318 20:23:18.172340       1 netroutes.go:63] | Destination: 0.0.0.0, Route: 10.1.208.1
I0318 20:23:18.172344       1 netroutes.go:63] | Destination: 10.1.211.16, Route: 0.0.0.0
I0318 20:23:18.172347       1 netroutes.go:63] | Destination: 10.1.215.239, Route: 0.0.0.0
I0318 20:23:18.172350       1 netroutes.go:63] | Destination: 10.1.221.39, Route: 0.0.0.0
I0318 20:23:18.172354       1 netroutes.go:63] | Destination: 10.1.210.236, Route: 0.0.0.0
I0318 20:23:18.172358       1 netroutes.go:63] | Destination: 10.1.217.86, Route: 0.0.0.0
I0318 20:23:18.172362       1 netroutes.go:63] | Destination: 10.1.221.252, Route: 0.0.0.0
I0318 20:23:18.172366       1 netroutes.go:63] | Destination: 10.1.223.10, Route: 0.0.0.0
I0318 20:23:18.172370       1 netroutes.go:63] | Destination: 169.254.169.254, Route: 0.0.0.0
I0318 20:23:18.172373       1 netroutes.go:63] | Destination: 10.1.208.0, Route: 0.0.0.0
I0318 20:23:18.172375       1 netroutes.go:63] | Destination: 10.1.216.233, Route: 0.0.0.0
I0318 20:23:18.172378       1 netroutes.go:63] | Destination: 10.1.212.211, Route: 0.0.0.0
I0318 20:23:18.172382       1 netroutes.go:63] | Destination: 10.1.214.227, Route: 0.0.0.0
I0318 20:23:18.172386       1 netroutes.go:63] | Destination: 10.1.215.57, Route: 0.0.0.0
I0318 20:23:18.172391       1 netroutes.go:63] | Destination: 10.1.222.214, Route: 0.0.0.0
I0318 20:23:18.172395       1 netroutes.go:65] +-------------------------------------------------------------------

Appreciate you looking into this by the way. Our subnet blocks are pretty static, so I wouldn't necessarily be opposed to declaring them all in the configs in the direct-classification overrides if need be.

mbolt35 commented 3 years ago

Thanks for the direct routes logs - those are normally used when the IP address is unrecognized (since they're supplied by the CNI implementation). It sounds like the symptoms of the problem are the same, but I just missed the mark on the actual cause. For now, it's probably best to add direct-classification for those CIDR blocks, and I'll dive into the classification ASAP and try and locate the actual source of the issue. Thanks again for the information and feedback here!

Will update when I find out more.

connorworkman commented 3 years ago

Seems like the answer might be in the default configs for networkCosts.config.destinations which gets injected into the network-costs-config configmap:

  config:
    # Configuration for traffic destinations, including specific classification
    # for IPs and CIDR blocks. This configuration will act as an override to the
    # automatic classification provided by network-costs.
    destinations:
      # In Zone contains a list of address/range that will be
      # classified as in zone.
      in-zone:
        # Loopback
        - "127.0.0.1"
        # IPv4 Link Local Address Space
        - "169.254.0.0/16"
        # Private Address Ranges in RFC-1918
        - "10.0.0.0/8"
        - "172.16.0.0/12"
        - "192.168.0.0/16"

This seems to assume that everything in the private address range is in-zone, which is definitely not the case for us. Overriding the in-zone addresses to just loopback, and then adding the rest of the subnets to the direct-classifications is working.

I can try overriding the in-zone configs alone to see if that was the only issue.

connorworkman commented 3 years ago

The automatic classification seems to be working with just the in-zone override:

$ perfkubectl get cm -n kubecost network-costs-config -o yaml
apiVersion: v1
data:
  config.yaml: |
    destinations:
      cross-region: []
      direct-classification: []
      in-region: []
      in-zone:
      - 127.0.0.1/32

Scratch that, now it's classifying some cross-zone traffic as internet traffic, so I'm going back to direct classifications (which is fine by me).

mbolt35 commented 3 years ago

Seems like the answer might be in the default configs for networkCosts.config.destinations which gets injected into the network-costs-config configmap:
  config:
    # Configuration for traffic destinations, including specific classification
    # for IPs and CIDR blocks. This configuration will act as an override to the
    # automatic classification provided by network-costs.
    destinations:
      # In Zone contains a list of address/range that will be
      # classified as in zone.
      in-zone:
        # Loopback
        - "127.0.0.1"
        # IPv4 Link Local Address Space
        - "169.254.0.0/16"
        # Private Address Ranges in RFC-1918
        - "10.0.0.0/8"
        - "172.16.0.0/12"
        - "192.168.0.0/16"
This seems to assume that everything in the private address range is in-zone, which is definitely not the case for us. Overriding the in-zone addresses to just loopback, and then adding the rest of the subnets to the direct-classifications is working.

I can try overriding the in-zone configs alone to see if that was the only issue.

Ok, this is actually beginning to make a bit more sense now. I had already spent some time trying to get my tests to break for specific combinations of inputs without any success, so I'm glad you were able to narrow this down. I forgot that 10.0.0.0/8 was in-zone mapped. This actually seems quite dubious after some thought, so I believe some additional documentation should be added here. I'm hesitant to completely remove, but I will continue to give this some thought. Thanks again for all your input here! I'm going to add more documentation so that we can put this to rest for the time being.

kirbsauce commented 3 years ago

@mbolt35 , any chance you were able to enhance the documentation and we can close this out?

mbolt35 commented 3 years ago

@kirbsauce Nothing too advanced: https://github.com/kubecost/cost-analyzer-helm-chart/pull/1014

Adam-Stack-PM commented 1 year ago

This issue has been marked as stale because it has not had recent activity. It will be closed if no further action occurs.

kubecost / cost-analyzer-helm-chart

Network traffic between AWS availability zones shows as "in zone" between k8s nodes in different AZs #820