Closed tlake closed 6 years ago
Error I got (once) while testing:
ServerError (code=21) Failed to start task: RESOURCE:MEMORY
This was after about 55 tasks have been submitted in ~3 mins
Oddly, mem usage via htop
only reported about 400/2000mb used.
On subsequent runs, I wasn't able to repro this error.
What does this pull request do?
layer0/common/aws/provider/connection.go
contained code for a ticker intended to provide a rate limit for thel0-api
. However, this resulted in infinitely-growing CPU usage on the AWS instance because:connection.go::getConfig()
was calledtime.Tick()
instead oftime.NewTicker()
yielded only the ticker - it did not yield a mechanism by which to terminate the ticker.This PR replaces
time.Tick()
withtime.NewTicker()
in order to obtain the mechanism by which a ticker can be stopped. The ticker - henceforth referred to as the rate limiter - is initialized with a defaulttime.Duration
value, and control of the rate limiter is passed to outside packages by way of public functions.api/main.go
andrunner/main.go
now handle the call totime.ParseDuration()
so that they can return usefully (instead of causing a panic down the line iftime.NewTicker()
is called withnil
), and they also containdefer provider.StopRateLimiter()
to make sure that they don't leave dangling resources.How should this be tested?
1. Deploy
Deploy a new new Layer0 instance with
l0-setup
.github.com/quintilesims/layer0//setup/module?ref=620-api-cpu-utilization
tlake620
xfra-dev
, or whatever key pair you use (necessary if you want to monitor the instance with SSH and htop)2. Monitor
Use one or both methods below to monitor CPU usage of the instance. You can try hammering it with a lot of tasks (
for n in {1..999} ; do echo iter ${n}: $(l0 task create ENVIRONMENT task${n} DEPLOY) ; done
) or services or something. In any event, the CPU should respond accordingly, but - and here's the key - it should always return to consuming barely anything when idle.If you were to deploy an API of v0.10.4-v0.10.8 and monitored it the same way, you would see the CPU usage floor of an idle API gradually increasing with time; it would grow faster the more AWS calls it had to make.
AWS Console
SSH and htop
ssh -i path/to/key-pair.pem ec2-user@${LAYER0_API_ENDPOINT} -o serveraliveinterval=30
sudo yum install -y htop
htop
F4
and enterl0
to filter out processes which aren't things likel0-api
andl0-runner
Some charts
Three-day CPU usage of Layer0 API in v0.10.8:
Three-day CPU usage of Layer0 API with these changes:
Checklist
I'm actually not sure how best to test this particular behavior, if there's a good way to test it at all.
Issues
Closes #620.