grafeas / kritis

Deploy-time Policy Enforcer for Kubernetes applications
https://github.com/grafeas/kritis/blob/master/docs/binary-authorization.md
Apache License 2.0
699 stars 135 forks source link

Kritis leaks memory #391

Closed klauern closed 5 years ago

klauern commented 5 years ago

Expected Behavior

It runs at a stable memory footprint

Actual Behavior

it doesn't?

Steps to Reproduce the Problem

Run it for a while

Additional info

I don't know how to better debug the issue, but here's a graph: image

Each of these is an individual container on a separate Kubernetes cluster, so they are similar in nature but will have their own footprint.

I'd appreciate any guidance on what I can do to better provide information on what is going on. This is my only window into the view that I can find, and I'm scratching my head to find other ways to get this information on a running cluster.

klauern commented 5 years ago

Note that the one that looks the most stable is our least-used server, so it doesn't get any traffic, hence it doesn't really fluctuate.

ooq commented 5 years ago

Hi @klauern , can you describe the workload a little bit?

vbanthia-zz commented 5 years ago

We also faced this issue and I did some profiling and found out memory leak is because of grafeasclient connections are not closed after the request is handled and there are many zombie open connections.

I created debug docker image with netstat installed and observe new grpc connections (TCP) are getting created whenever a new pod is created. These connections do not get closed and left open.

root@kritis-validation-hook-f76fd4c75-z2cvg:/go# netstat -W | wc -l
524

Connections

tcp        0      0 kritis-validation-hook-f76fd4c75-z2cvg:55436 nrt12s23-in-f10.1e100.net:https ESTABLISHED
tcp        0      0 kritis-validation-hook-f76fd4c75-z2cvg:44904 nrt12s13-in-f10.1e100.net:https ESTABLISHED
tcp        0      0 kritis-validation-hook-f76fd4c75-z2cvg:54140 nrt12s15-in-f74.1e100.net:https ESTABLISHED

NOTE: 1e100.net is google domain for containeranalysis API.

We internally use kritis fork and temporary using this fix. https://github.com/mercari/kritis/pull/18

Ideally, Kritis should not create grafeasclient inside handler but instead should be initialized before calling request handler so that same connection can be reused between goroutines (request handlers)

(https://github.com/grafeas/kritis/blob/681e6d3e8a4675d386e580fd643304c5e0263245/pkg/kritis/admission/admission.go#L246)

aysylu commented 5 years ago

Thank you @klauern for reporting the issue and @vbanthia for contributing a potential fix! I'd be happy to review a PR with the fix, merge and release it as 0.1.1.

vbanthia-zz commented 5 years ago

Wrote a PR to fix this issue. https://github.com/grafeas/kritis/pull/394

aysylu commented 5 years ago

Thanks for the fix, @vbanthia! Let's keep this issue open until we've refactored Kritis code per your comment in #394:

Ideally, Kritis should not create upstream client inside handler. It should be created outside and passed as an argument to request handlers. This way same client can be used by all goroutines. This will require a bit of refactoring, till then we can use this solution to fix memory issues.

Happy to accept contributions that address this with refactoring.

klauern commented 5 years ago

We will be testing this out in the next week on our internal tools K8s cluster. We should be able to tell fairly quickly if this is fixed.

aysylu commented 5 years ago

Great, thanks for the update! Looking forward to hearing from you.

On Thu, Sep 26, 2019 at 11:05 AM Nick Klauer notifications@github.com wrote:

We will be testing this out in the next week on our internal tools K8s cluster. We should be able to tell fairly quickly if this is fixed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grafeas/kritis/issues/391?email_source=notifications&email_token=AAEXLWUK4KEQBQBR5GAZ5BTQLTFUFA5CNFSM4IOGQS22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7V46VQ#issuecomment-535547734, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEXLWUVESMYZP5QDEIOO3DQLTFUFANCNFSM4IOGQS2Q .

aysylu commented 5 years ago

@klauern have you been able to confirm this by any chance?

aysylu commented 5 years ago

Marking this resolved due to inactivity. Please re-open if this is still an issue.