[Bug] VictoriaMetrics for the fqdn of Prometheus, the Pod goes into CrashLoop.

githubeto commented 7 months ago

Kubecost Helm Chart Version

2.2.2

Kubernetes Version

1.26

Kubernetes Platform

EKS

Description

I installed KubeCost via Helm. The cost-analyzer Pod remains in CrashLoop, but where could the cause be?

install comannds helm upgrade --install kubecost -f values.yaml kubecost/cost-analyzer -n kubecost

values.yaml

global:
prometheus:
enabled: false
fqdn: http://vmselect-vm-stack-victoria-metrics-k8s-stack.vm.svc.cluster.local:8481/select/0/prometheus
grafana:
enabled: false
notifications:
alertmanager:
  enabled: false
  fqdn: http://vmalertmanager-vm-stack-victoria-metrics-k8s-stack.vm.svc.cluster.local:9093
prometheus:
nodeExporter:
enabled: false
serviceAccounts:
nodeExporter:
  create: false
kubeStateMetrics:
enabled: false
kubecostModel:
extraEnv:
- name: LOG_LEVEL
  value: error
persistentVolume:
enabled: true
size: 32Gi
dbSize: 32Gi
networkCosts:
podMonitor:
enabled: false
serviceMonitor:
enabled: false
additionalLabels: {}
kubecostToken: "<your token>"

kubectl get svc -n vm

NAME                                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
vm-stack-grafana                                     ClusterIP   172.20.211.192   <none>        80/TCP                       79d
vm-stack-kube-state-metrics                          ClusterIP   172.20.237.153   <none>        8080/TCP                     79d
vm-stack-prometheus-node-exporter                    ClusterIP   172.20.50.30     <none>        9100/TCP                     79d
vm-stack-victoria-metrics-operator                   ClusterIP   172.20.142.249   <none>        8080/TCP,443/TCP             79d
vmagent-vm-stack-victoria-metrics-k8s-stack          ClusterIP   172.20.52.16     <none>        8429/TCP                     79d
vmalert-vm-stack-victoria-metrics-k8s-stack          ClusterIP   172.20.105.118   <none>        8080/TCP                     79d
vmalertmanager-vm-stack-victoria-metrics-k8s-stack   ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   79d
vminsert-vm-stack-victoria-metrics-k8s-stack         ClusterIP   172.20.33.81     <none>        8480/TCP                     79d
vmselect-vm-stack-victoria-metrics-k8s-stack         ClusterIP   None             <none>        8481/TCP                     79d
vmstorage-vm-stack-victoria-metrics-k8s-stack        ClusterIP   None             <none>        8482/TCP,8400/TCP,8401/TCP   79d

kubectl describe pod kubecost-cost-analyzer-f87bcdff6-thdfj

Name:             kubecost-cost-analyzer-f87bcdff6-thdfj
Namespace:        kubecost
Priority:         0
Service Account:  kubecost-cost-analyzer
Node:             ip-10-219-182-8.ap-northeast-1.compute.internal/10.219.182.8
Start Time:       Thu, 25 Apr 2024 12:41:37 +0900
Labels:           app=cost-analyzer
app.kubernetes.io/instance=kubecost
app.kubernetes.io/name=cost-analyzer
helm-rollout-restarter=Pn98x
pod-template-hash=f87bcdff6
Annotations:      <none>
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.219.183.82
IPs:
IP:           10.219.183.82
Controlled By:  ReplicaSet/kubecost-cost-analyzer-f87bcdff6
Containers:
cost-model:
Container ID:   containerd://0436cebbb21e77b1ab728946865bc1a7fb90231cf5d07f94ce0745591ed97935
Image:          gcr.io/kubecost1/cost-model:prod-2.2.2
Image ID:       gcr.io/kubecost1/cost-model@sha256:5f2f478de00ee6f4a331818eea0f5a9f3adafa8fdd922d0555eeacea3f3c0eee
Ports:          9003/TCP, 9090/TCP
Host Ports:     0/TCP, 0/TCP
State:          Running
Started:      Thu, 25 Apr 2024 12:41:52 +0900
Ready:          True
Restart Count:  0
Requests:
cpu:      200m
memory:   55Mi
Liveness:   http-get http://:9003/healthz delay=10s timeout=1s period=10s #success=1 #failure=200
Readiness:  http-get http://:9003/healthz delay=10s timeout=1s period=10s #success=1 #failure=200
Environment:
GRAFANA_ENABLED:                            false
HELM_VALUES:                                ---
PROMETHEUS_SERVER_ENDPOINT:                 <set to the key 'prometheus-server-endpoint' of config map 'kubecost-cost-analyzer'>  Optional: false
CLOUD_COST_ENABLED:                         false
CLOUD_PROVIDER_API_KEY:                     AIzaSyDXQPG_MHUEy9neR7stolq6l0ujXmjJlvk
CONFIG_PATH:                                /var/configs/
DB_PATH:                                    /var/db/
CLUSTER_PROFILE:                            production
EMIT_POD_ANNOTATIONS_METRIC:                false
EMIT_NAMESPACE_ANNOTATIONS_METRIC:          false
EMIT_KSM_V1_METRICS:                        true
EMIT_KSM_V1_METRICS_ONLY:                   false
LOG_COLLECTION_ENABLED:                     true
PRODUCT_ANALYTICS_ENABLED:                  true
ERROR_REPORTING_ENABLED:                    true
VALUES_REPORTING_ENABLED:                   true
SENTRY_DSN:                                 https://71964476292e4087af8d5072afe43abd@o394722.ingest.sentry.io/5245431
LEGACY_EXTERNAL_API_DISABLED:               false
OUT_OF_CLUSTER_PROM_METRICS_ENABLED:        false
CACHE_WARMING_ENABLED:                      false
SAVINGS_ENABLED:                            true
ETL_ENABLED:                                true
ETL_STORE_READ_ONLY:                        false
ETL_CLOUD_USAGE_ENABLED:                    false
CLOUD_ASSETS_EXCLUDE_PROVIDER_ID:           false
ETL_RESOLUTION_SECONDS:                     300
ETL_MAX_PROMETHEUS_QUERY_DURATION_MINUTES:  1440
ETL_DAILY_STORE_DURATION_DAYS:              91
ETL_HOURLY_STORE_DURATION_HOURS:            49
ETL_WEEKLY_STORE_DURATION_WEEKS:            53
ETL_FILE_STORE_ENABLED:                     true
ETL_ASSET_RECONCILIATION_ENABLED:           true
ETL_USE_UNBLENDED_COST:                     false
CONTAINER_STATS_ENABLED:                    true
RECONCILE_NETWORK:                          true
KUBECOST_METRICS_POD_ENABLED:               false
PV_ENABLED:                                 true
MAX_QUERY_CONCURRENCY:                      5
UTC_OFFSET:                                 +00:00
CLUSTER_ID:                                 cluster-one
COST_EVENTS_AUDIT_ENABLED:                  false
RELEASE_NAME:                               kubecost
KUBECOST_NAMESPACE:                         kubecost
POD_NAME:                                   kubecost-cost-analyzer-f87bcdff6-thdfj (v1:metadata.name)
KUBECOST_TOKEN:                             <set to the key 'kubecost-token' of config map 'kubecost-cost-analyzer'>  Optional: false
WATERFOWL_ENABLED:                          true
DIAGNOSTICS_RUN_IN_COST_MODEL:              false
Mounts:
/var/configs from persistent-configs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w6sck (ro)
cost-analyzer-frontend:
Container ID:   containerd://9eeb78b79a8451fad04701982aba82140a9cb417968ae254cd97bf5e96ecb3c5
Image:          gcr.io/kubecost1/frontend:prod-2.2.2
Image ID:       gcr.io/kubecost1/frontend@sha256:196f0a8b847af6d8d111918ce91859a1dc1fd3fb593d0ac397db453a8090af64
Port:           <none>
Host Port:      <none>
State:          Waiting
Reason:       CrashLoopBackOff
Last State:     Terminated
Reason:       Error
Exit Code:    1
Started:      Thu, 25 Apr 2024 12:44:59 +0900
Finished:     Thu, 25 Apr 2024 12:44:59 +0900
Ready:          False
Restart Count:  5
Requests:
cpu:      10m
memory:   55Mi
Liveness:   http-get http://:9003/healthz delay=1s timeout=1s period=5s #success=1 #failure=6
Readiness:  http-get http://:9003/healthz delay=1s timeout=1s period=5s #success=1 #failure=6
Environment:
GET_HOSTS_FROM:  dns
Mounts:
/etc/nginx/conf.d/ from nginx-conf (rw)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w6sck (ro)
aggregator:
Container ID:  containerd://e95ce3ff496b5875c61e205fce23a5dc9597956049263f7bf2148c9b06a6ee1c
Image:         gcr.io/kubecost1/cost-model:prod-2.2.2
Image ID:      gcr.io/kubecost1/cost-model@sha256:5f2f478de00ee6f4a331818eea0f5a9f3adafa8fdd922d0555eeacea3f3c0eee
Port:          9004/TCP
Host Port:     0/TCP
Args:
waterfowl
State:          Running
Started:      Thu, 25 Apr 2024 12:41:53 +0900
Ready:          True
Restart Count:  0
Readiness:      http-get http://:9004/healthz delay=10s timeout=1s period=10s #success=1 #failure=200
Environment:
CLUSTER_ID:                     cluster-one
NUM_DB_COPY_CHUNKS:             25
CONFIG_PATH:                    /var/configs/
ETL_ENABLED:                    false
CLOUD_PROVIDER_API_KEY:         AIzaSyDXQPG_MHUEy9neR7stolq6l0ujXmjJlvk
READ_ONLY:                      false
CUSTOM_COST_ENABLED:            false
DB_CONCURRENT_INGESTION_COUNT:  3
DB_READ_THREADS:                1
DB_WRITE_THREADS:               1
LOG_LEVEL:                      info
KUBECOST_NAMESPACE:             kubecost
Mounts:
/var/configs from persistent-configs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w6sck (ro)
cloud-cost:
Container ID:  containerd://6d5325d1a0c3d08fe7364bea82a780bff824f8b8a17b8a6849ceae350de87cc3
Image:         gcr.io/kubecost1/cost-model:prod-2.2.2
Image ID:      gcr.io/kubecost1/cost-model@sha256:5f2f478de00ee6f4a331818eea0f5a9f3adafa8fdd922d0555eeacea3f3c0eee
Port:          9005/TCP
Host Port:     0/TCP
Args:
cloud-cost
State:          Running
Started:      Thu, 25 Apr 2024 12:41:54 +0900
Ready:          True
Restart Count:  0
Readiness:      http-get http://:9005/healthz delay=10s timeout=1s period=10s #success=1 #failure=200
Environment:
CONFIG_PATH:                    /var/configs/
ETL_DAILY_STORE_DURATION_DAYS:  91
CLOUD_COST_REFRESH_RATE_HOURS:  6
CLOUD_COST_QUERY_WINDOW_DAYS:   7
CLOUD_COST_RUN_WINDOW_DAYS:     3
CUSTOM_COST_ENABLED:            false
CLOUD_COST_IS_INCLUDE_LIST:     false
CLOUD_COST_LABEL_LIST:
CLOUD_COST_TOP_N:               1000
Mounts:
/var/configs from persistent-configs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w6sck (ro)
Conditions:
Type              Status
Initialized       True
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
tmp:
Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:  <unset>
nginx-conf:
Type:      ConfigMap (a volume populated by a ConfigMap)
Name:      nginx-conf
Optional:  false
persistent-configs:
Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName:  kubecost-cost-analyzer
ReadOnly:   false
kube-api-access-w6sck:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type     Reason                  Age                    From                     Message
----     ------                  ----                   ----                     -------
Warning  FailedAttachVolume      4m3s                   attachdetach-controller  Multi-Attach error for volume "pvc-744cb938-a3eb-4d5b-a4de-7c1c277ada63" Volume is already used by pod(s) kubecost-cost-analyzer-c5f8b69f6-sw8vs
Normal   Scheduled               4m3s                   default-scheduler        Successfully assigned kubecost/kubecost-cost-analyzer-f87bcdff6-thdfj to ip-10-219-182-8.ap-northeast-1.compute.internal
Normal   SuccessfulAttachVolume  3m50s                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-744cb938-a3eb-4d5b-a4de-7c1c277ada63"
Normal   Pulling                 3m49s                  kubelet                  Pulling image "gcr.io/kubecost1/cost-model:prod-2.2.2"
Normal   Pulled                  3m48s                  kubelet                  Successfully pulled image "gcr.io/kubecost1/frontend:prod-2.2.2" in 659.679635ms (659.695261ms including waiting)
Normal   Pulled                  3m48s                  kubelet                  Successfully pulled image "gcr.io/kubecost1/cost-model:prod-2.2.2" in 661.153338ms (661.168436ms including waiting)
Normal   Created                 3m48s                  kubelet                  Created container cost-model
Normal   Started                 3m48s                  kubelet                  Started container cost-model
Normal   Pulling                 3m47s                  kubelet                  Pulling image "gcr.io/kubecost1/cost-model:prod-2.2.2"
Normal   Created                 3m47s                  kubelet                  Created container aggregator
Normal   Started                 3m47s                  kubelet                  Started container aggregator
Normal   Pulling                 3m47s                  kubelet                  Pulling image "gcr.io/kubecost1/cost-model:prod-2.2.2"
Normal   Pulled                  3m47s                  kubelet                  Successfully pulled image "gcr.io/kubecost1/cost-model:prod-2.2.2" in 628.325558ms (628.334576ms including waiting)
Normal   Started                 3m46s                  kubelet                  Started container cloud-cost
Normal   Pulled                  3m46s                  kubelet                  Successfully pulled image "gcr.io/kubecost1/cost-model:prod-2.2.2" in 636.665922ms (636.681828ms including waiting)
Normal   Created                 3m46s                  kubelet                  Created container cloud-cost
Normal   Started                 3m45s (x2 over 3m47s)  kubelet                  Started container cost-analyzer-frontend
Normal   Created                 3m45s (x2 over 3m48s)  kubelet                  Created container cost-analyzer-frontend
Normal   Pulled                  3m45s                  kubelet                  Successfully pulled image "gcr.io/kubecost1/frontend:prod-2.2.2" in 579.980397ms (579.995788ms including waiting)
Warning  BackOff                 3m42s (x3 over 3m44s)  kubelet                  Back-off restarting failed container cost-analyzer-frontend in pod kubecost-cost-analyzer-f87bcdff6-thdfj_kubecost(3da88910-4745-4ca2-9410-7e12c30594f2)
Normal   Pulling                 3m29s (x3 over 3m48s)  kubelet                  Pulling image "gcr.io/kubecost1/frontend:prod-2.2.2"
Normal   Pulled                  3m29s                  kubelet                  Successfully pulled image "gcr.io/kubecost1/frontend:prod-2.2.2" in 642.180442ms (642.192646ms including waiting)

kubectl get pvc

NAME                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
kubecost-cost-analyzer   Bound    pvc-744cb938-a3eb-4d5b-a4de-7c1c277ada63   32Gi       RWO            gp2            22m

As an additional note, given that we are able to view metrics by specifying "http://vmselect-vm-stack-victoria-metrics-k8s-stack.vm.svc:8481/select/0/prometheus" as the data source from Grafana, we are installing KubeCost. Therefore, I am confident that there is no mistake in what is specified as the FQDN.

Steps to reproduce

helm install

Expected behavior

The cost-analyzer Pod is starting up.

Logs

> kubectl logs deployment/kubecost-cost-analyzer -c cost-model | grep ERR

2024-04-25T03:41:53.447563687Z ERR Failed to lookup reserved instance data: No Athena Bucket configured
2024-04-25T03:41:53.447594779Z ERR Failed to lookup savings plan data: No Athena Bucket configured
2024-04-25T03:41:58.837139146Z ERR Alerts config file failed to load: open /var/configs/alerts/alerts.json: no such file or directory
2024-04-25T03:41:58.8372359Z ERR savings: cluster sizing: failed to get workload distributions: failed to query allocations: failed to query store: boundary error: requested [2024-04-24T00:00:00+0000, 2024-04-26T00:00:00+0000); supported [2024-04-25T03:41:58+0000, 2024-04-25T03:41:58+0000): AllocationStore[1d]:  store does not have coverage to perform query
2024-04-25T03:41:58.837514818Z ERR savings: finding abandoned workloads: failed to query store: boundary error: requested [2024-04-24T00:00:00+0000, 2024-04-26T00:00:00+0000); supported [2024-04-25T03:41:58+0000, 2024-04-25T03:41:58+0000): AllocationStore[1d]:  store does not have coverage to perform query
2024-04-25T03:41:58.837785169Z ERR unable to get most recent valid asset set: failed to query from assets for 2024-04-24 21:41:58.837751006 +0000 UTC/2024-04-25 03:41:58.837751006 +0000 UTC: boundary error: requested [2024-04-24T21:41:58+0000, 2024-04-25T03:41:58+0000); supported [2024-04-25T03:41:58+0000, 2024-04-25T03:41:58+0000): Store[1h]: store does not have coverage to perform query
2024-04-25T03:42:09.527104955Z ERR unable to get most recent valid asset set: could not obtain latest valid asset set



### Troubleshooting

- [X] I have read and followed the [issue guidelines](https://github.com/kubecost/cost-analyzer-helm-chart/blob/develop/ISSUE_GUIDELINES.md) and this is a bug impacting only the Helm chart.
- [X] I have searched other issues in this repository and mine is not recorded.

williameasiernetworks commented 7 months ago

@githubeto Have you reviewed this blog post?

https://www.opencost.io/blog/victoria-metrics

Though it's usually easier to use the bundled prometheus we package in our helm chart.

chipzoller commented 6 months ago

Does not appear to be an issue with the Helm chart. Transferred to the correct repository.

githubeto commented 6 months ago

@williamkubecost Thanks for your comment.

@githubeto Have you reviewed this blog post?

https://www.opencost.io/blog/victoria-metrics

Though it's usually easier to use the bundled prometheus we package in our helm chart.

Yes, I have confirmed that it works with a combination of VictoriaMetrics and OpenCost HelmCharts, and we have completed operational verification. This time, the requirement is to use VictoriaMetrics, which is in the same cluster, instead of Prometheus.

nextopsvideos commented 6 months ago

@githubeto I have encountered the same issue. However, in my case, when I investigated the logs of the failing container "cost-analyzer-frontend" I found that it is unable to connect to Grafana. We need to update the helm values file pointing to victoria metrics grafana instance.

`global: prometheus: enabled: false
fqdn: http://vmsingle-k8s-stack.vm.svc:8429 <--- this is your vm service

grafana: enabled: false domainName: vm-grafana.vm.svc <--- this has to be updated accordingly`

After doing these changes kubecost pods started working normally.

githubeto commented 6 months ago

@nextopsvideos

@githubeto I have encountered the same issue. However, in my case, when I investigated the logs of the failing container "cost-analyzer-frontend" I found that it is unable to connect to Grafana. We need to update the helm values file pointing to victoria metrics grafana instance.

`global: prometheus: enabled: false fqdn: http://vmsingle-k8s-stack.vm.svc:8429 <--- this is your vm service

grafana: enabled: false domainName: vm-grafana.vm.svc <--- this has to be updated accordingly`

After doing these changes kubecost pods started working normally.

Oh, it started working properly when I specified Grafana that is being built with VictoriaMetrics for grafana.domainName! Thank you for the advice!

However, I don't quite understand the meaning of connecting from Kubecost to Grafana. I think this process should be unnecessary if we can see the metrics data, so it should not be needed when grafana.enabled = false.

chipzoller commented 1 month ago

Hello, in an effort to consolidate our bug and feature request tracking, we are deprecating using GitHub to track tickets. If this issue is still outstanding and you have not done so already, please raise a request at https://support.kubecost.com/.

kubecost / features-bugs