kubernetes / kubectl

Issue tracker and mirror of kubectl code
Apache License 2.0
2.84k stars 917 forks source link

kubectl delete leaks network connections when deleting multiple resources, causing warnings or errors #1128

Closed amarshall closed 1 year ago

amarshall commented 3 years ago

What happened:

kubectl delete leaks network connections when deleting multiple resources, causing warnings or errors

What you expected to happen:

kubectl to not leak network connections

How to reproduce it (as minimally and precisely as possible):

  1. Create 1001+ pods
  2. Perform kubectl delete pod -l mypod=foo

Receive warning (Linux):

W1011 16:39:05.007295 3346 exec.go:282] constructing many client instances from the same exec auth config can cause performance problems during cert rotation and can exhaust available network connections; 1001 clients constructed calling "aws"

Receive error (macOS):

socket: too many open files

Anything else we need to know?:

Environment:

k8s-ci-robot commented 3 years ago

@amarshall: This issue is currently awaiting triage.

SIG CLI takes a lead on issue triage for this repo, but any Kubernetes member can accept issues by applying the triage/accepted label.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
eddiezane commented 2 years ago

This may be fixed on the upcoming 1.23 release or by https://github.com/kubernetes/kubernetes/pull/105723.

You might be able to temporarily fix this by raising the ulimit on your mac.

@ash2k this might still be an issue here? https://github.com/kubernetes/kubectl/blob/a450ebd59c1e8917df23d37c6f05d8e16c3746aa/pkg/cmd/delete/delete.go#L392

eddiezane commented 2 years ago

/assign atiratree

ash2k commented 2 years ago

@eddiezane I'm not sure, sorry.

atiratree commented 2 years ago

@amarshall this seems like a duplicate of https://github.com/kubernetes/kubernetes/issues/91913. Can you please try again with latest kubectl?

sidenote: I have also tested this with plain kubernetes (not aws-iam) and have not observed such issue as there is only one connection opened.

eddiezane commented 2 years ago

Great reproduction example in https://github.com/kubernetes/kubectl/issues/1152

atiratree commented 2 years ago

TY, I managed to reproduce #1152 on AWS EKS (1.21) with latest kubectl. Vanilla kubernetes works fine.

seeing multiple connections:


kubectl 430171 user   18u     IPv4 2968415      0t0      TCP p1:49786->ec2-IP.eu-central-1.compute.amazonaws.com:https (ESTABLISHED)
kubectl 430171 user   20u     IPv4 2968418      0t0      TCP p1:40788->ec2-IP.eu-central-1.compute.amazonaws.com:https (ESTABLISHED)
kubectl 430171 user   22u     IPv4 2974513      0t0      TCP p1:49790->ec2-IP.eu-central-1.compute.amazonaws.com:https (ESTABLISHED)

after reaching ulimit -n opened files limit I am getting connection errors:

Unable to connect to the server: dial tcp: lookup 123456789ABCDEF.gr7.eu-central-1.eks.amazonaws.com: too many open files
Unable to connect to the server: dial tcp: lookup 123456789ABCDEF.gr7.eu-central-1.eks.amazonaws.com: too many open files
Unable to connect to the server: dial tcp: lookup 123456789ABCDEF.gr7.eu-central-1.eks.amazonaws.com: too many open files

I am planning to work on the fix soon

atiratree commented 2 years ago

This bug applies to all commands (create/apply/delete/describe) that use Exec credentials plugin

Description of the current code flow and behaviour for kubectl delete:

when running delete

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/kubectl/pkg/cmd/delete/delete.go#L212

with client obtained here

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/cli-runtime/pkg/resource/builder.go#L905

and initialized via

for each deletion new client in info is used

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/kubectl/pkg/cmd/delete/delete.go#L326

and also for each deletion new client is created

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/client-go/rest/config.go#L330

and new TransportConfig is created

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/client-go/rest/transport.go#L33

which is then ammended with credentials Exec provider

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/client-go/rest/transport.go#L132

Authenticator is cached so there are no duplicate queries to the executable until the token expires (even when used over multiple clients):

https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/client-go/plugin/pkg/client/auth/exec/exec.go#L163-L168

tlsCache is used for transports if custom transport is not specified: https://github.com/kubernetes/kubernetes/blob/267272efe0725e14b3c2c7bc6fa3dd64a922a6a7/staging/src/k8s.io/client-go/transport/transport.go#L50

atiratree commented 2 years ago

^ I think the fix should be just to enable the TLS caching for this Exec plugin use case:

atiratree commented 2 years ago

posted a PR with a fix https://github.com/kubernetes/kubernetes/pull/108274

until that gets accepted a following workaround can be used:

TOKEN="`AWS_PROFILE=my-profile aws --region my-region eks get-token --cluster-name my-cluster | jq ".status.token" | sed 's/"//g'`"

kubectl delete pod -l mypod=foo --token=$TOKEN
atiratree commented 2 years ago

I think the fix should be just to enable the TLS caching for this Exec plugin use case:

This approach was discussed in https://github.com/kubernetes/kubernetes/pull/108274#discussion_r812046486 and was not shown as viable.

I will try to refactor kubectl generically and use similar approach as https://github.com/kubernetes/kubernetes/pull/105490.

brianpursley commented 2 years ago

@atiratree I think you've got the right idea about kubectl needing to reuse a single transport across all requests.

It looks like anything that uses builder will have this problem where a new client is created for each invocation of Visit().
https://github.com/kubernetes/kubernetes/blob/0b8d725f5a04178caf09cd802305c4b8370db65e/staging/src/k8s.io/cli-runtime/pkg/resource/visitor.go#L430

It looks like getClient() will always return a new client, and there is currently no way (currently) to tell the builder to use a specific client instance. https://github.com/kubernetes/kubernetes/blob/0b8d725f5a04178caf09cd802305c4b8370db65e/staging/src/k8s.io/cli-runtime/pkg/resource/builder.go#L925-L945 (Well, the new client actually happens deep inside the call stack, but that is where it begins)

I wonder if a good approach would be to add a WithClient(client *http.Client) function to the builder. Is this along the lines of what you were thinking? Is there any other way to tell the builder to use a specific client instead of creating its own?

atiratree commented 2 years ago

@brianpursley Thanks, I had to get down to the ConfigFlags to support also the DiscoverClient. Please see https://github.com/kubernetes/kubernetes/pull/108459 for the implementation.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

atiratree commented 2 years ago

/remove-lifecycle stale

enj commented 2 years ago

kubectl should definitely stop making an infinite number of clients, but until it does, https://github.com/kubernetes/kubernetes/pull/112017 should prevent connection leaks.

enj commented 2 years ago

kubectl should definitely stop making an infinite number of clients, but until it does, kubernetes/kubernetes#112017 should prevent connection leaks.

This fix is merged and backported. The next release of kubectl should include the fix.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubectl/issues/1128#issuecomment-1419045647): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.