kubecost / features-bugs

A public repository for filing of Kubecost feature requests and bugs. Please read the issue guidelines before filing an issue here.
0 stars 0 forks source link

[Bug] Network costs crashing on node startup #56

Closed elcomtik closed 1 month ago

elcomtik commented 9 months ago

Kubecost Version

2.0.2

Kubernetes Version

1.28

Kubernetes Platform

EKS

Description

I updated from kubecost v1.105.1 to v1.106.7 and later to v2.0.2

Network costs were updated from v0.6.7 to v0.7.2 and problems started when the new k8s node was created.

I use standard AL2 EKS nodes, it worked before the update. It seems that the downgrade of network-costs container fixed the issue temporarily.

Steps to reproduce

When a new k8s node is created Daemonset starts the network-costs pod on the new node and the following error occurs on pod:

ERROR kube_client::client::builder: failed with error error trying to connect: deadline has elapsed
ERROR kube_client::client::builder: failed with error error trying to connect: deadline has elapsed
ERROR kube_client::client::builder: failed with error error trying to connect: deadline has elapsed
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: HyperError(hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))', src/kube/netmap.rs:570:44 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: HyperError(hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))', src/kube/netmap.rs:619:45
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: HyperError(hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))', src/kube/netmap.rs:663:47

The pod is not emitting any metrics and also not restarted automatically.

If I restart it manually, it starts working as expected.

Expected behavior

Network costs start without panicking, or getting restarted by the liveness probe.

Impact

Network cost metrics are not complete.

Screenshots

No response

Logs

ERROR kube_client::client::builder: failed with error error trying to connect: deadline has elapsed
ERROR kube_client::client::builder: failed with error error trying to connect: deadline has elapsed
ERROR kube_client::client::builder: failed with error error trying to connect: deadline has elapsed
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: HyperError(hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))', src/kube/netmap.rs:570:44 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: HyperError(hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))', src/kube/netmap.rs:619:45
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: HyperError(hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))', src/kube/netmap.rs:663:47

Slack discussion

No response

Troubleshooting

AjayTripathy commented 9 months ago

cc @mbolt35 is this a timeout contacting the k8s apiserver?

AjayTripathy commented 8 months ago

Hi @elcomtik we're pretty sure that this is an intermittent timeout contacting the API server on pod startup, which is why the restart/rollback appears to fix. Could you try again and see if the behavior is the same?

elcomtik commented 8 months ago

What version should I test?

AjayTripathy commented 8 months ago

v0.17.3

zumic96 commented 6 months ago

I have added startupProbe and livenessProbe to DaemonSet to go around this issue, strangely I thought they were there by default.

livenessProbe:
  tcpSocket:
    port: 3001
  initialDelaySeconds: 5
  timeoutSeconds: 1
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 5
startupProbe:
  tcpSocket:
    port: 3001
  initialDelaySeconds: 5
  timeoutSeconds: 1
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 5
elcomtik commented 5 months ago

just tested v0.17.3, works good. This issue can be closed.

Thanks a lot!

elcomtik commented 5 months ago

I have to admit, I made premature conclusions. Still the same issue.

elcomtik commented 5 months ago

@zumic96 adding probes works for me too

chipzoller commented 1 month ago

Hello, in an effort to consolidate our bug and feature request tracking, we are deprecating using GitHub to track tickets. If this issue is still outstanding and you have not done so already, please raise a request at https://support.kubecost.com/.