googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.09k stars 812 forks source link

CPU/Memory leak issue caused by go routines that never completes #636

Closed ilkercelikyilmaz closed 5 years ago

ilkercelikyilmaz commented 5 years ago

I was running a load/stress test to allocated 10K gameservers every 30 minutes (gameservers shutdowns in 10 minutes after allocation). The performance of the Agones was deteriorating with everyone run. It turns out the 10K+ go routines that are created with every run never completes. In about 7 hours around the total number of go routines reached 400K.

image

The issue id being caused because of WaitForCacheSync call in gameserver creation. Under huge number of calls, cache never syncs and all the go routines continue to wait indefinitely.

This can be related to #414 .

Working on a fix now. Will submit a PR.

aLekSer commented 5 years ago

Hello @ilkercelikyilmaz , this seems the same old issue, but reopened after dep ensure execution and I suppose was fixed as previously in: https://github.com/GoogleCloudPlatform/agones/commit/1bdd5a5083c9f732c58f8be9521505b7fb37a764 So please update and rerun your test, at least starting part. 😄

ilkercelikyilmaz commented 5 years ago

Hi @aLekSer , You are right this fixes the issue. However I got it fixed locally by removing the call WaitForCacheSync. I've tested with the fix only (without removing the WaitForCacheSync). There is no more leak but the GS creation with the WaitForCacheSync and it takes longer under the load test. Do you think we can close the #414 ? Thanks, ilker

aLekSer commented 5 years ago

Thanks @ilkercelikyilmaz for checking it with updated master. Regarding this Memory leak ticket I think you should close this and open new ticket or Pull Request with different description if you see some other problem with WaitForCacheSync() function. WaitForCacheSync() uses WaitFor() undernearth through next function:

func PollUntil(interval time.Duration, condition ConditionFunc, stopCh <-chan struct{}) error {
    return WaitFor(poller(interval, 0), condition, stopCh)
}

So fixing WaitFor helps here also.

ilkercelikyilmaz commented 5 years ago

Since the cpu/memory leak issue fixed with 1bdd5a5 closing this issue.