quintilesims / layer0

Build, Manage, and Deploy Your Applications
Apache License 2.0
44 stars 20 forks source link

Fix CPU leak caused by `time.Tick` #621

Closed tlake closed 6 years ago

tlake commented 6 years ago

What does this pull request do?

layer0/common/aws/provider/connection.go contained code for a ticker intended to provide a rate limit for the l0-api. However, this resulted in infinitely-growing CPU usage on the AWS instance because:

This PR replaces time.Tick() with time.NewTicker() in order to obtain the mechanism by which a ticker can be stopped. The ticker - henceforth referred to as the rate limiter - is initialized with a default time.Duration value, and control of the rate limiter is passed to outside packages by way of public functions. api/main.go and runner/main.go now handle the call to time.ParseDuration() so that they can return usefully (instead of causing a panic down the line if time.NewTicker() is called with nil), and they also contain defer provider.StopRateLimiter() to make sure that they don't leave dangling resources.

How should this be tested?

1. Deploy

Deploy a new new Layer0 instance with l0-setup.

2. Monitor

Use one or both methods below to monitor CPU usage of the instance. You can try hammering it with a lot of tasks (for n in {1..999} ; do echo iter ${n}: $(l0 task create ENVIRONMENT task${n} DEPLOY) ; done) or services or something. In any event, the CPU should respond accordingly, but - and here's the key - it should always return to consuming barely anything when idle.

If you were to deploy an API of v0.10.4-v0.10.8 and monitored it the same way, you would see the CPU usage floor of an idle API gradually increasing with time; it would grow faster the more AWS calls it had to make.

AWS Console

SSH and htop

Some charts

Three-day CPU usage of Layer0 API in v0.10.8:

Three-day CPU usage of Layer0 API in v0.10.8

Three-day CPU usage of Layer0 API with these changes:

Three-day CPU usage of Layer0 API with these changes

Checklist

I'm actually not sure how best to test this particular behavior, if there's a good way to test it at all.

Issues

Closes #620.

diemonster commented 6 years ago

Error I got (once) while testing: ServerError (code=21) Failed to start task: RESOURCE:MEMORY

This was after about 55 tasks have been submitted in ~3 mins

Oddly, mem usage via htop only reported about 400/2000mb used.

On subsequent runs, I wasn't able to repro this error.