kubecost / features-bugs

A public repository for filing of Kubecost feature requests and bugs. Please read the issue guidelines before filing an issue here.
0 stars 0 forks source link

[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

Open githubeto opened 4 months ago

githubeto commented 4 months ago

Kubecost Version

2.2.5

Kubernetes Version

1.28

Kubernetes Platform

EKS

Description

As shown in the screenshot, I cannot access the complete diagnostic page. Why is that? Currently, we are using the Athena configuration, but since the CUR has not yet arrived in S3, we are in a waiting status.

helm chart

global:
  grafana:
    enabled: false
    proxy: false
  prometheus:
    enabled: true
ingress:
  enabled: false
kubecostModel:
  etlAssetReconciliationEnabled: false
  etlCloudUsage: false
  extraEnv:
  - name: LOG_LEVEL
    value: warn
  utcOffset: "+09:00"
kubecostProductConfigs:
  athenaBucketName: s3://skystyle-mng-athena-log
  athenaDatabase: athenacurcfn_skystyle_mng_kubecost
  athenaProjectID: "xxxxxxxxxxx"
  athenaRegion: ap-northeast-1
  athenaTable: skystyle_mng_kubecost
  athenaWorkgroup: spdkube-aws-mgr-athena-workgroup
  awsSpotDataBucket: spot-instance-datafeed-subscription
  awsSpotDataRegion: ap-northeast-1
  projectID: "xxxxxxxxxxx"
kubecostToken: xxxxxxxxxxx
networkPolicy:
  enabled: false
persistentVolume:
  dbSize: 32Gi
  enabled: true
  size: 32Gi
pricingCsv:
  enabled: false
priority:
  enabled: false
prometheus:
  server:
    global:
      evaluation_interval: 1m
      external_labels:
        cluster_id: aws-mgr
      scrape_interval: 1m
      scrape_timeout: 60s
reporting:
  productAnalytics: false
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/aws-mgr-kubecost-role
  create: true
  name: kubecost

kubectl logs -f deployment.apps/kubecost-cost-analyzer -c cost-model | grep ERR

ERR Failed to query prometheus at http://kubecost-prometheus-server.kubecost. Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=up&time=1716355356": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'up' . Troubleshooting help available at: http://docs.kubecost.com/custom-prom#troubleshoot
ERR Failed to lookup reserved instance data: no reservation data available in Athena
ERR Failed to lookup savings plan data: Error fetching Savings Plan Data: QueryAthenaPaginated: query execution error: no query results available for query 5266696b-f7d7-4570-9dc5-013eb76a8690
ERR Alerts config file failed to load: open /var/configs/alerts/alerts.json: no such file or directory
ERR savings: cluster sizing: failed to get monthly cluster rates: error getting valid asset set in MonthlyNodeClusterRates: failed to query from assets for 2024-05-22 00:00:00 +0000 UTC/2024-05-23 00:00:00 +0000 UTC: boundary error: requested [2024-05-22T00:00:00+0000, 2024-05-23T00:00:00+0000); supported [2024-05-22T02:00:00+0000, 2024-05-22T05:22:51+0000): Store[1h]: store does not have coverage to perform query
ERR Asset ETL: ComputeAssets: clusterManagementQuery: Prometheus communication error: sum_over_time((avg(kubecost_cluster_management_cost{}) by (cluster_id))[60m:1m] offset 322m) * 0.016667: retrying
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-21 00:00:00 +0000 UTC End:2024-05-22 00:00:00 +0000 UTC}': building [2024-05-21 00:00:00 +0000 UTC-2024-05-21 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-21T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-21T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-20 00:00:00 +0000 UTC End:2024-05-21 00:00:00 +0000 UTC}': building [2024-05-20 00:00:00 +0000 UTC-2024-05-20 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-20T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-20T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR Asset ETL: ComputeAssets: clusterManagementQuery: Prometheus communication error: sum_over_time((avg(kubecost_cluster_management_cost{}) by (cluster_id))[60m:1m] offset 322m) * 0.016667: retrying
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-19 00:00:00 +0000 UTC End:2024-05-20 00:00:00 +0000 UTC}': building [2024-05-19 00:00:00 +0000 UTC-2024-05-19 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-19T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-19T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR CostModel.ComputeAllocation: failed to build pod map: Prometheus communication error: avg(kube_pod_container_status_running{} != 0) by (pod, namespace, cluster_id)[1h:5m]
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22true%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09label_replace%28%0A%09%09%09%09%09sum_over_time%28container_memory_working_set_bytes%7Bcontainer%21%3D%22%22%2C+container%21%3D%22POD%22%2C+instance%21%3D%22%22%2C+%7D%5B2m%5D+%29%2C+%22node%22%2C+%22%241%22%2C+%22instance%22%2C+%22%28.%2B%29%22%0A%09%09%09%09%29%2C+%22container_name%22%2C+%22%241%22%2C+%22container%22%2C+%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C+%22%241%22%2C+%22pod%22%2C+%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2C+container_name%2C+pod_name%2C+node%2C+cluster_id%29&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'avg(
ERR ComputeCostData: Parsing Error: Prometheus communication error: avg(
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22false%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09label_replace%28%0A%09%09%09%09%09rate%28%0A%09%09%09%09%09%09container_cpu_usage_seconds_total%7Bcontainer%21%3D%22%22%2C+container%21%3D%22POD%22%2C+instance%21%3D%22%22%2C+%7D%5B2m%5D+%0A%09%09%09%09%09%29%2C+%22node%22%2C+%22%241%22%2C+%22instance%22%2C+%22%28.%2B%29%22%0A%09%09%09%09%29%2C+%22container_name%22%2C+%22%241%22%2C+%22container%22%2C+%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C+%22%241%22%2C+%22pod%22%2C+%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2C+container_name%2C+pod_name%2C+node%2C+cluster_id%29&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'avg(
ERR ComputeCostData: Parsing Error: Prometheus communication error: avg(
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22true%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR CostModel.ComputeAllocation: query context error Errors:
ERR CostModel.ComputeAllocation: query context error Errors:
ERR CostModel.ComputeAllocation: query context error Errors:

Steps to reproduce

  1. helm install
  2. Spot datafeed setup
  3. AWS Cloud Billing Integration

Expected behavior

can access diagnostics page

Impact

No response

Screenshots

Logs

No response

Slack discussion

No response

Troubleshooting

dwbrown2 commented 4 months ago

@jessegoodier @AjayTripathy or others will likely be able to provide more detailed troubleshooting recommendations, but it looks like your prometheus isn't reachable. What's status of that pod?

githubeto commented 4 months ago

@jessegoodier @AjayTripathy or others will likely be able to provide more detailed troubleshooting recommendations, but it looks like your prometheus isn't reachable. What's status of that pod?

Prometheus pod is running.

jessegoodier commented 4 months ago

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

githubeto commented 4 months ago

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

jessegoodier commented 4 months ago

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

You can try a curl from the frontend:

kubectl exec -i -t -n kubecost deployments/kubecost-cost-analyzer -c cost-analyzer-frontend -- curl http://kubecost-prometheus-server.kubecost

should get: <a href="/graph">Found</a>

you can also try a curl to other pods, perhaps grafana?

curl http://kubecost-grafana.kubecost
jessegoodier commented 4 months ago

Because Kubecost does not block traffic, I would not expect any logs, other than the communication failures you are seeing.

Do you have another cluster to test on to rule out other issues?

githubeto commented 4 months ago

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

You can try a curl from the frontend:

kubectl exec -i -t -n kubecost deployments/kubecost-cost-analyzer -c cost-analyzer-frontend -- curl http://kubecost-prometheus-server.kubecost

should get: <a href="/graph">Found</a>

you can also try a curl to other pods, perhaps grafana?

curl http://kubecost-grafana.kubecost

@jessegoodier The curl to kubecost-prometheus-server returned the correct response "Found". As you can see from the Helm Chart, Grafana is not running, so it has not been checked.

There are no clusters without Istio, making it difficult to verify.

jessegoodier commented 4 months ago

We do not have other reports of this.

I don't have any other ideas here. Very strange the test command works but the cost-model container cannot communicate.

githubeto commented 4 months ago

@jessegoodier @AjayTripathy

While closely monitoring the browser access logs, I found an interesting log. Does this error log correspond to the reason why the Full Diagnostics screen cannot be displayed?

This error log seems to be a 403 error (Rate limit) when accessing https://api.github.com/repositories/178079595/releases or https://api.github.com/repos/kubecost/cost-model/releases.

178079595/releases response:

{
    "message": "API rate limit exceeded for xx.xx.xx.xx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
    "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"
}