kubecost / cost-analyzer-helm-chart

Kubecost helm chart
http://kubecost.com/install
Apache License 2.0
489 stars 417 forks source link

[Bug] Kubecost failling to fetch data: error querying DB #3685

Closed ohayak closed 3 weeks ago

ohayak commented 4 weeks ago

Kubecost Helm Chart Version

2.4.0

Kubernetes Version

1.30

Kubernetes Platform

EKS

Description

After upgrade from 2.3.5 to 2.4.0 my kubecost UI encountryng 500 errors from backend:

{
    "code": 500,
    "data": null,
    "message": "error querying DB: error querying via querysvc query Failure: Binder Error: Values list \"t\" does not have a column named \"RatedClusterManagementCostShared\"\nLINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...\n                                                  ^"
}

Prometheus Integration: OK Cloud Integration: OK

Steps to reproduce

  1. Delete all previous generated result reports from S3
  2. I upgraded my Helm chart from 2.3.5 to 2.4.0 with the following values:
    clusterController:
    enabled: true
    image:
    repository: public.ecr.aws/kubecost/cluster-controller
    kubescaler:
    resizeAllDefault: true
    extraObjects:
    - apiVersion: v1
    data:
    tls.crt: REDACTED
    tls.key: REDACTED
    kind: Secret
    metadata:
    name: webhook-server-tls
    type: kubernetes.io/tls
    forecasting:
    fullImageName: public.ecr.aws/kubecost/kubecost-modeling:v0.1.16
    global:
    grafana:
    domainName: grafana.kube-monitor
    enabled: false
    fqdn: grafana.kube-monitor
    proxy: false
    scheme: http
    mimirProxy:
    enabled: true
    mimirEndpoint: http://mimir-nginx.mimir.svc
    orgIdentifier: ""
    notifications:
    alertmanager:
      enabled: false
      fqdn: http://mimir-nginx.mimir.svc/alertmanager
    prometheus:
    enabled: false
    fqdn: http://kubecost-cost-analyzer-mimir-proxy:8085/prometheus
    insecureSkipVerify: true
    grafana:
    sidecar:
    dashboards:
      annotations:
        grafana_dashboard_folder: KubeCost
      enabled: true
    datasources:
      enabled: false
    ingress:
    annotations:
    alb.ingress.kubernetes.io/group.name: example
    alb.ingress.kubernetes.io/healthcheck-port: tcp-frontend
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80,"HTTPS":443}]'
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb
    enabled: true
    hosts:
    - kubecost.123456789000.eu.aws.example.com
    path: /
    pathType: Prefix
    kubecostAdmissionController:
    caBundle: REDACTED
    enabled: true
    kubecostAggregator:
    extraEnv:
    - name: GRAFANA_ENABLED
    value: "true"
    kubecostFrontend:
    image: public.ecr.aws/kubecost/frontend
    kubecostModel:
    image: public.ecr.aws/kubecost/cost-model
    kubecostProductConfigs:
    athenaBucketName: s3://example-kubecost-results-20240423161914359500000003
    athenaDatabase: athenacurcfn_kubecost
    athenaProjectID: "123456789000"
    athenaRegion: eu-west-1
    athenaTable: kubecost
    carbonEstimates:
    enabled: true
    clusterName: example
    defaultIdle: true
    gpuLabel: nvidia.com/gpu
    gpuLabelValue: true
    grafanaURL: https://grafana.123456789000.eu.aws.example.com
    labelMappingConfigs:
    cluster_external_label: Application
    enabled: true
    namespace_external_label: Application/Tier
    projectID: "123456789000"
    shareTenancyCosts: true
    sharedNamespaces: default,kube-system,kube-public,kube-node-lease,crossplane-system,kubecost,grafana,mimir,loki,alloy,argocd,argowf,external-dns,cert-manager,external-secrets,api,front,redis,kube-monitor,lgtm-monitor,meta-monitor,kuik-system
    sharedOverhead: null
    spotLabel: karpenter.sh/capacity-type
    spotLabelValue: spot
    networkCosts:
    config:
    services:
      amazon-web-services: true
    enabled: true
    image:
    repository: public.ecr.aws/kubecost/kubecost-network-costs
    tag: v0.17.6
    oidc:
    authURL: https://datastudio-dev.auth.eu-west-1.amazoncognito.com/oauth2/authorize?client_id=REDACTED&response_type=code&scope=openid+aws.cognito.signin.user.admin&redirect_uri=https%3A%2F%2Fkubecost.123456789000.eu.aws.example.com/model/oidc/authorize
    clientID: REDACTED
    clientSecret: REDACTED
    discoveryURL: https://cognito-idp.eu-west-1.amazonaws.com/eu-west-REDACTED/.well-known/openid-configuration
    enabled: true
    loginRedirectURL: https://kubecost.123456789000.eu.aws.example.com/model/oidc/authorize
    rbac:
    enabled: false
    groups:
    - claimName: custom:groups
      claimValues:
      - ADMIN
      enabled: true
      name: admin
    - enabled: false
      name: readonly
    secretName: kubecost-oidc-secret
    useIDToken: false
    persistentVolume:
    annotations:
    argocd.argoproj.io/sync-options: Prune=false
    helm.sh/resource-policy: keep
    prometheus:
    configmapReload:
    prometheus:
      image:
        repository: public.ecr.aws/kubecost/prometheus-config-reloader
    server:
    global:
      external_labels:
        cluster_id: example
    image:
      repository: public.ecr.aws/kubecost/prometheus
    prometheusRule:
    enabled: true
    reporting:
    productAnalytics: false
    serviceAccount:
    annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789000:role/example-kubecost-irsa-role
    create: true
    name: kubecost
    serviceMonitor:
    enabled: true
    networkCosts:
    enabled: true
    upgrade:
    toV2: true

Expected behavior

Getting my Monitoring data

Impact

No response

Screenshots

Screenshot 2024-09-25 at 13 50 10 Screenshot 2024-09-25 at 13 39 46 Screenshot 2024-09-25 at 13 48 51

Logs

kubecost-controller (repeating following error):

2024-09-25T11:40:01Z ERR Run loop attempt failed consecutively. Waiting a bit longer before retrying. error="failed to create sizing patch for workload 'crossplane-system/deployment/crossplane-rbac-manager': failed to get recommendation for '{Namespace:crossplane-system ControllerKind:deployment ControllerName:crossplane-rbac-manager ContainerName:crossplane}' '{Window:2d TargetUtilizationCPU:0xc001384838 TargetUtilizationMemory:0xc001384840}': failed to get local cluster ID: non-OK status code (403), body: "
Kubecost-cost-model:
2024-09-25T11:45:00.08096136Z WRN got error could not obtain latest valid asset set for metric localDisks%%Production, not adding to cache
2024-09-25T11:45:00.098005903Z ERR unable to get most recent valid asset set: could not obtain latest valid asset set
2024-09-25T11:45:00.098074787Z WRN got error could not obtain latest valid asset set for metric localDisks%%High-Availability, not adding to cache
2024/09/25 11:45:05 http: TLS handshake error from 10.152.192.66:46770: remote error: tls: bad certificate
2024/09/25 11:45:06 http: TLS handshake error from 10.152.192.66:46782: remote error: tls: bad certificate
2024/09/25 11:45:06 http: TLS handshake error from 10.152.192.66:46792: remote error: tls: bad certificate
2024/09/25 11:45:06 http: TLS handshake error from 10.152.192.66:46802: remote error: tls: bad certificate
2024/09/25 11:45:06 http: TLS handshake error from 10.152.192.66:46806: remote error: tls: bad certificate
2024-09-25T11:45:06.733764075Z INF Unable to get X-API-KEY header
2024-09-25T11:45:06.733924809Z WRN No token cookie set
2024-09-25T11:45:06.733971569Z INF Unable to get Authorization header
2024-09-25T11:45:06.734087025Z WRN error in token validation: cannot find Authorization header
2024-09-25T11:45:06.734159106Z WRN error in token validation: cannot parse token from request: no token for request GET /clusterInfo HTTP/1.1 in cookie or header
2024/09/25 11:45:06 http: TLS handshake error from 10.152.192.66:46816: remote error: tls: bad certificate
2024/09/25 11:45:06 http: TLS handshake error from 10.152.192.66:46832: remote error: tls: bad certificate
2024/09/25 11:45:07 http: TLS handshake error from 10.152.192.66:46844: remote error: tls: bad certificate
2024-09-25T11:45:36.870523034Z INF Unable to get X-API-KEY header
2024-09-25T11:45:36.871373067Z WRN No token cookie set
2024-09-25T11:45:36.871528327Z INF Unable to get Authorization header
2024-09-25T11:45:36.87164215Z WRN error in token validation: cannot find Authorization header
2024-09-25T11:45:36.871716962Z WRN error in token validation: cannot parse token from request: no token for request GET /clusterInfo HTTP/1.1 in cookie or header
2024-09-25T11:46:07.020947872Z INF Unable to get X-API-KEY header
2024-09-25T11:46:07.021009388Z WRN No token cookie set
2024-09-25T11:46:07.021022977Z INF Unable to get Authorization header
2024-09-25T11:46:07.021033681Z WRN error in token validation: cannot find Authorization header
2024-09-25T11:46:07.02105918Z WRN error in token validation: cannot parse token from request: no token for request GET /clusterInfo HTTP/1.1 in cookie or header

Kubecost-cost-analyzer: Same 500 Errors from UI Aggregator Logs

LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:05.298045083Z INF token is unverifiable: signing method RS256 is invalid
2024-09-25T14:42:05.298230572Z WRN Share is non-nil (AST: N/A) and SharedNamespaces or SharedLabels is filled out. Waterfowl does not support arbitrary Share, so it will be ignored under the assumption that it is supposed to capture the values of SharedNamespaces ([default kube-system kube-public kube-node-lease crossplane-system kubecost grafana mimir loki alloy argocd argowf external-dns cert-manager external-secrets api front redis kube-monitor lgtm-monitor meta-monitor kuik-system]) and SharedLabels (map[]).
2024-09-25T14:42:05.315828359Z INF token is unverifiable: signing method RS256 is invalid
2024-09-25T14:42:05.316028267Z WRN Share is non-nil (AST: N/A) and SharedNamespaces or SharedLabels is filled out. Waterfowl does not support arbitrary Share, so it will be ignored under the assumption that it is supposed to capture the values of SharedNamespaces ([default kube-system kube-public kube-node-lease crossplane-system kubecost grafana mimir loki alloy argocd argowf external-dns cert-manager external-secrets api front redis kube-monitor lgtm-monitor meta-monitor kuik-system]) and SharedLabels (map[]).
2024-09-25T14:42:05.325852892Z INF Unable to get X-API-KEY header
2024-09-25T14:42:05.358721873Z ERR error querying via querysvc query Failure: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:05.375779896Z ERR error querying via querysvc query Failure: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:05.389344786Z INF Unable to get X-API-KEY header
2024-09-25T14:42:05.436108515Z INF token is unverifiable: signing method RS256 is invalid
2024-09-25T14:42:05.49596537Z ERR error querying via querysvc querying summary total failed with err: failed to execute query QuerySummaryTotal: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:05.504818204Z INF token is unverifiable: signing method RS256 is invalid
2024-09-25T14:42:05.505362506Z WRN Share is non-nil (AST: N/A) and SharedNamespaces or SharedLabels is filled out. Waterfowl does not support arbitrary Share, so it will be ignored under the assumption that it is supposed to capture the values of SharedNamespaces ([default kube-system kube-public kube-node-lease crossplane-system kubecost grafana mimir loki alloy argocd argowf external-dns cert-manager external-secrets api front redis kube-monitor lgtm-monitor meta-monitor kuik-system]) and SharedLabels (map[]).
2024-09-25T14:42:05.580252964Z ERR error querying via querysvc query Failure: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:05.718360684Z INF Unable to get X-API-KEY header
2024-09-25T14:42:05.825151745Z INF token is unverifiable: signing method RS256 is invalid
2024-09-25T14:42:05.825431853Z WRN Share is non-nil (AST: N/A) and SharedNamespaces or SharedLabels is filled out. Waterfowl does not support arbitrary Share, so it will be ignored under the assumption that it is supposed to capture the values of SharedNamespaces ([default kube-system kube-public kube-node-lease crossplane-system kubecost grafana mimir loki alloy argocd argowf external-dns cert-manager external-secrets api front redis kube-monitor lgtm-monitor meta-monitor kuik-system]) and SharedLabels (map[]).
2024-09-25T14:42:06.009143845Z ERR error querying via querysvc query Failure: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:06.105683279Z ERR error querying via querysvc query Failure: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...
                                                  ^
2024-09-25T14:42:06.237199482Z ERR error querying via querysvc query Failure: Binder Error: Values list "t" does not have a column named "RatedClusterManagementCostShared"
LINE 1: ...ficient AS RatedPVSharedCost, COALESCE(t.RatedClusterManagementCostShared / to...

When scrolling up the aggregator I've seen this error a couple time after the first call to the UI:

[90m2024-09-25T14:54:16.729777179Z INF token is unverifiable: signing method RS256 is invalid
2024-09-25T14:54:17.779358278Z INF Unable to get X-API-KEY header
2024/09/25 14:54:17 http: panic serving 10.152.195.91:5391: runtime error: invalid memory address or nil pointer dereference
goroutine 1673 [running]:
net/http.(*conn).serve.func1()
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:1903 +0xbe
panic({0x49d9ba0?, 0x8aaf6b0?})
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/kubecost/kubecost-cost-model/pkg/duckdb/ingest.(*IngestStats).RefreshModel(0xc0017ebbb8?, 0xc00147d288?)
    /app/kubecost-cost-model/pkg/duckdb/ingest/ingeststats.go:490 +0x23
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.(*IngestorDebug).Refresh(0xc0017ebb00)
    /app/kubecost-cost-model/pkg/duckdb/write/ingestor.go:193 +0x25
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.(*Orchestrator).DebugInfo(0xc001798700, {0x68d2e68?, 0xc0019c4000?})
    /app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:571 +0x24c
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.(*Orchestrator).OrchestratorDebugHandler(0xc001798700, {0x68c0fd0, 0xc00438d8f0}, 0xc0042fb200, {0xc001edad70?, 0xc004731cab?, 0x0?})
    /app/kubecost-cost-model/pkg/duckdb/orchestrator/handlers.go:54 +0x113
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0014f96e0, {0x68c0fd0, 0xc00438d8f0}, 0xc0042fb200)
    /root/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0x7eb
net/http.(*ServeMux).ServeHTTP(0xc0042cdf78?, {0x68c0fd0, 0xc00438d8f0}, 0xc0042fb200)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2688 +0x1ad
github.com/opencost/opencost/pkg/metrics.ResponseMetricMiddleware.func1({0x68c0bb0, 0xc002c7db20}, 0xc0042fb200)
    /app/opencost/pkg/metrics/httpmetricmiddleware.go:25 +0x10f
net/http.HandlerFunc.ServeHTTP(0xc002004d80?, {0x68c0bb0?, 0xc002c7db20?}, 0xc0042fb200?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2171 +0x29
github.com/kubecost/kubecost-cost-model/pkg/cmd/waterfowl.Execute.(*Cors).Handler.func133({0x68c0bb0, 0xc002c7db20}, 0xc0042fb200)
    /root/go/pkg/mod/github.com/rs/cors@v1.8.2/cors.go:231 +0x184
net/http.HandlerFunc.ServeHTTP(0xb0bdb6?, {0x68c0bb0?, 0xc002c7db20?}, 0x0?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2171 +0x29
github.com/kubecost/kubecost-cost-model/pkg/cmd/waterfowl.Execute.PanicHandlerMiddleware.func136({0x68c0bb0?, 0xc002c7db20?}, 0xc003e3dd60?)
    /app/opencost/pkg/errors/panic.go:76 +0x78
net/http.HandlerFunc.ServeHTTP(0xc0042fb200?, {0x68c0bb0?, 0xc002c7db20?}, 0xc005508140?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2171 +0x29
github.com/kubecost/kubecost-cost-model/pkg/cmd/waterfowl.Execute.AuthOIDCMiddleware.func147({0x68c0bb0, 0xc002c7db20}, 0xc0042fb200)
    /app/kubecost-cost-model/pkg/auth/oidcauth.go:123 +0xfb
net/http.HandlerFunc.ServeHTTP(0x4994b20?, {0x68c0bb0?, 0xc002c7db20?}, 0x13?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2171 +0x29
github.com/kubecost/kubecost-cost-model/pkg/gating.ScaleGatingMiddleware.func1({0x68c0bb0, 0xc002c7db20}, 0xc0042fb200)
    /app/kubecost-cost-model/pkg/gating/middleware.go:83 +0xa5c
net/http.HandlerFunc.ServeHTTP(0xc003e7efc0?, {0x68c0bb0?, 0xc002c7db20?}, 0xc0042fb200?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2171 +0x29
github.com/kubecost/kubecost-cost-model/pkg/cmd/waterfowl.Execute.(*Cors).Handler.func149({0x68c0bb0, 0xc002c7db20}, 0xc0042fb200)
    /root/go/pkg/mod/github.com/rs/cors@v1.8.2/cors.go:231 +0x184
net/http.HandlerFunc.ServeHTTP(0x6bd679?, {0x68c0bb0?, 0xc002c7db20?}, 0xc001ce4b68?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2171 +0x29
net/http.serverHandler.ServeHTTP({0xc00438c3f0?}, {0x68c0bb0?, 0xc002c7db20?}, 0x6?)
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:3142 +0x8e
net/http.(*conn).serve(0xc003d71c20, {0x68d2e30, 0xc003ecb920})
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:2044 +0x5e8
created by net/http.(*Server).Serve in goroutine 1
    /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.7.linux-amd64/src/net/http/server.go:3290 +0x4b4

Slack discussion

No response

Troubleshooting

baileydoestech commented 4 weeks ago

We are also seeing this issue after upgrading to 2.4.0

As far as I understand, the release notes state that downgrading to 2.3.5 is not possible so any advice would be appreciated here.

Currently our kubecost UI is unusable

cliffcolvin commented 4 weeks ago

@baileydoestech and @ohayak we have identified the issue here and are resolving with a patch. I'll be cutting 2.4.1-rc.1 in just a moment and will release once its tested.

There is a backwards compatibility 2.3.5 image that you can use gcr.io/kubecost1/cost-model:2.3.5-compat-with-2.4.0 as a temporary work around, but I will make sure we test and get this 2.4.1 patch out as soon as possible.

cliffcolvin commented 3 weeks ago

@baileydoestech and @ohayak can you try 2.4.1 and let me know if this resolves your issues? I'm going to close the issue now, but if you see any issues continue we can reopen, or create a new issue depending.

baileydoestech commented 3 weeks ago

Working now thanks