davemeyer commented 6 years ago

Expected behavior

the api service should be performant and not consume all available resources.

Actual behavior

the l0 api service consumes 100% of available cpu shortly after being started. reboots result in the same behavior. changing the api instance type - from t2.small to t2.xlarge - is helping, but it looks like the same trend will be occurring.

after we have max cpu usage, the api seems far more likely to return the infamous EOF error. resulting in failed environments with incorrect state.

Steps to reproduce the behavior

let the api run for a day.

v0.10.8 currently. but we've seen since at least v0.10.5

tlake commented 6 years ago

After much digging, I think I've finally gotten a handle on the problem.

In Layer0 v0.10.4, in order to solve some rate-limiting issues, a ticker was introduced in the layer0/common/aws/provider/connection.go::getConfig() function. This function is called several times at the top of the l0-api's execution, and then once more every time a connection to AWS is made.

According to Golang's documentation, the particular time.Tick() we used is billed as a "convenience wrapper" for time.NewTicker(), but time.Tick() provides "access to the ticking channel only." The ticker created by time.Tick() does not yield a mechanism for calling .Stop() on the ticker object, nor will Golang's garbage collector ever clean up such a ticker object. In essence, once created, there is no way to stop these time.Tick() tickers, short of stopping the Go process itself.

In summary, then, l0-api is creating a new, permanent ticker with every request to AWS, and each ticker alone is using up CPU to constantly poll the system clock.

It seems like the solution may be to replace time.Tick() with time.NewTicker() and make sure to call .Stop() on it at the end of each connection.

tlake commented 6 years ago

Further information:

It looks like the current implementation does intend to create the ticker as a singleton, sort of.

The l0-api, at startup, makes a call to GetBackend() which will eventually hit getConfig() once per thing-that-needs-a-backend (tagStore, s3provider, iamProvider, ec2Provider, ecsProvider, elbProvider, autoscalingProvider, cloudwatchlogsProvider, I believe - at least, these are the things that come back from GetBackend(), and I've traced the logic only through the ec2Provider but there are several other get[AWS_THING]Connection() functions).

So there is a finite number of tickers spun up for the main l0-api, and even though that number is > 1, it wouldn't be a problem. Tracking a half-dozen redundant tickers would consume some amount of CPU unnecessarily, but it wouldn't be an infinitely-growing situation.

But then also GetBackend() is called once for every l0-runner that is spun up. l0-runner processes get spun up all the time, do their thing, and then go away - except for the open ticker they leave behind, which has no mechanism for closing it and which cannot be garbage collected.

Possible solution

I think there are two things to do:

I can turn the ticker into an actual singleton by using Golang's init() function. This will ensure that the ticker is only created once, at the time of package import.
I can swap out time.Tick() for time.NewTicker() to gain access to ticker.Stop().

I'll also need to see how well this fix holds up for the runner processes as well.

quintilesims / layer0

API cpu utilization pegged #620

Expected behavior

Actual behavior

Steps to reproduce the behavior

Further information:

Possible solution