Closed davemeyer closed 6 years ago
After much digging, I think I've finally gotten a handle on the problem.
In Layer0 v0.10.4, in order to solve some rate-limiting issues, a ticker was introduced in the layer0/common/aws/provider/connection.go::getConfig()
function. This function is called several times at the top of the l0-api
's execution, and then once more every time a connection to AWS is made.
According to Golang's documentation, the particular time.Tick()
we used is billed as a "convenience wrapper" for time.NewTicker()
, but time.Tick()
provides "access to the ticking channel only." The ticker created by time.Tick()
does not yield a mechanism for calling .Stop()
on the ticker object, nor will Golang's garbage collector ever clean up such a ticker object. In essence, once created, there is no way to stop these time.Tick()
tickers, short of stopping the Go process itself.
In summary, then, l0-api
is creating a new, permanent ticker with every request to AWS, and each ticker alone is using up CPU to constantly poll the system clock.
It seems like the solution may be to replace time.Tick()
with time.NewTicker()
and make sure to call .Stop()
on it at the end of each connection.
It looks like the current implementation does intend to create the ticker as a singleton, sort of.
The l0-api
, at startup, makes a call to GetBackend()
which will eventually hit getConfig()
once per thing-that-needs-a-backend (tagStore
, s3provider
, iamProvider
, ec2Provider
, ecsProvider
, elbProvider
, autoscalingProvider
, cloudwatchlogsProvider
, I believe - at least, these are the things that come back from GetBackend()
, and I've traced the logic only through the ec2Provider
but there are several other get[AWS_THING]Connection()
functions).
So there is a finite number of tickers spun up for the main l0-api
, and even though that number is > 1, it wouldn't be a problem. Tracking a half-dozen redundant tickers would consume some amount of CPU unnecessarily, but it wouldn't be an infinitely-growing situation.
But then also GetBackend()
is called once for every l0-runner
that is spun up. l0-runner
processes get spun up all the time, do their thing, and then go away - except for the open ticker they leave behind, which has no mechanism for closing it and which cannot be garbage collected.
I think there are two things to do:
init()
function. This will ensure that the ticker is only created once, at the time of package import.time.Tick()
for time.NewTicker()
to gain access to ticker.Stop()
.I'll also need to see how well this fix holds up for the runner processes as well.
Expected behavior
the api service should be performant and not consume all available resources.
Actual behavior
the l0 api service consumes 100% of available cpu shortly after being started. reboots result in the same behavior. changing the api instance type - from t2.small to t2.xlarge - is helping, but it looks like the same trend will be occurring.
after we have max cpu usage, the api seems far more likely to return the infamous
EOF
error. resulting in failed environments with incorrect state.Steps to reproduce the behavior
let the api run for a day.
v0.10.8 currently. but we've seen since at least v0.10.5