Scaling not working after scaling in to 1

commarla commented 4 years ago

Describe the bug Sometimes when one of my task group is scaled in to 1 allocation, sherpa seems to forgot the policies and doesn't scale anymore.

To reproduce I am using nomad meta.

sherpa_cooldown:"180"
sherpa_enabled:"1"
sherpa_max_count:"5"
sherpa_min_count:"1"
sherpa_scale_in_count:"1"
sherpa_scale_in_cpu_percentage_threshold:"30"
sherpa_scale_in_memory_percentage_threshold:"30"
sherpa_scale_out_count:"1"
sherpa_scale_out_cpu_percentage_threshold:"70"
sherpa_scale_out_memory_percentage_threshold:"70"

The job is scaling in to 1 then never scale out

Screenshot 2020-01-02 15 09 53

I just need to increase manually the count of the task group from 1 to 2 and sherpa continue to scale out normally.

Screenshot 2020-01-02 15 13 14

In this example the service was stuck for 15+ days. In 0.3.0 I have the following logs:

Jan 02 14:37:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:37:56.421144110Z","message":"no task groups in job have scaling policies enabled"}
Jan 02 14:38:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:38:56.422893887Z","message":"no task groups in job have scaling policies enabled"}

After upgrading to 0.4.0

Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-02T13:57:21.298110900Z","message":"triggering autoscaling job evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-02T13:57:21.306032109Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-02T13:57:21.306043091Z","message":"triggering autoscaling job group evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-02T13:57:21.306154810Z","message":"scaling evaluation completed and no scaling required"}

Expected behavior A scaling out event.

Environment:

Sherpa server information (retrieve with sherpa system info):

/usr/bin # sherpa system info
Nomad Address                http://nomad.eu-central-1.elb.amazonaws.com:4646
Policy Engine                Nomad Job Group Meta
Storage Backend              Consul
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):

/usr/bin # sherpa --version
sherpa version v0.4.0
    Date:   2020-01-02 14:26:03.176156495 +0000 UTC
    Commit: 212ca6e
    Branch: v0.4.0
    State:  v0.4.0

Server Operating System/Architecture: Debian 9.11

Sherpa server configuration parameters:

SHERPA_AUTOSCALER_ENABLED=true
SHERPA_AUTOSCALER_EVALUATION_INTERVAL=60
SHERPA_AUTOSCALER_NUM_THREADS=20
SHERPA_BIND_ADDR=0.0.0.0
SHERPA_BIND_PORT=8000
SHERPA_CLUSTER_ADVERTISE_ADDR=http://127.0.0.1:8000
SHERPA_CLUSTER_NAME=prod-main-admin
SHERPA_LOG_FORMAT=auto
SHERPA_LOG_LEVEL=debug
SHERPA_LOG_USE_COLOR=true
SHERPA_POLICY_ENGINE_API_ENABLED=false
SHERPA_POLICY_ENGINE_NOMAD_META_ENABLED=true
SHERPA_POLICY_ENGINE_STRICT_CHECKING_ENABLED=true
SHERPA_STORAGE_CONSUL_ENABLED=true
SHERPA_STORAGE_CONSUL_PATH=sherpa/
SHERPA_TELEMETRY_PROMETHEUS=true
SHERPA_UI=true

commarla commented 4 years ago

After my manual scaling out to fix the issue I don't see this log anymore : failed to collect Nomad metrics, skipping Nomad based checks

jrasell commented 4 years ago

@commarla thanks for the report; i'll take a look this afternoon.

jrasell commented 4 years ago

I have tried to reproduce this locally without success, so will keep trying over the next day or so. Using the nomad meta policy engine the example job was scaled down to its minimum allowed count of 1:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
43cd970f  3d79e0f6  cache       3        run      running   5m8s ago    2m19s ago
75fd797a  3d79e0f6  cache       2        stop     complete  6m20s ago   3m19s ago
c41c66d8  3d79e0f6  cache       1        stop     complete  16m33s ago  5m7s ago
e9d80206  3d79e0f6  cache       0        stop     complete  16m33s ago  14m46s ago

Under minimal load Sherpa wanted to scale-in, but was prevented by the scaling policy:

5:17PM INF Nomad resource utilisation calculation cpu-value-percentage=10.466054549222797 group=cache job=example2 mem-value-percentage=0
5:17PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:17PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"in","group":"cache"}
5:17PM DBG scaling action will break job group minimum threshold group=cache job=example2
5:17PM ERR failed to trigger scaling of job error="scaling action will break job group minimum threshold" job=example2

I then loaded the service and triggered a successful scaling out event:

5:23PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:23PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"out","group":"cache"}
5:23PM DBG meta watcher last index has changed new=83 old=72
5:23PM DBG job modify index has changed is greater than last recorded job=example2 new=83 old=72
5:23PM DBG received job list update message to handle
5:23PM DBG reading job group meta stanzas job=example2
5:23PM DBG meta watcher last index has changed new=85 old=83
5:23PM INF successfully triggered autoscaling of job evaluation-id=e0155fc9-7414-11c9-67ac-39232bcd8992 id=dc7e8685-3bd0-410d-ac61-9a4b2f1765cc job=example2

From the description you gave, it is almost as if the meta engine incorrectly removed the policy during the last scaling event. Does the job contain multiple groups or just a single one? Are there any more debug logs from the times you were having issues that I could look at?

commarla commented 4 years ago

Thanks @jrasell There are 2 groups and I don't see anything else in the logs.

I also already got this issue with other jobs but didn't took time to fill an issue. All my jobs with sherpa meta have at least 2 task groups.

Is there a way to check if sherpa needs more threads or some tuning ?

commarla commented 4 years ago

Hi,

It is live. My task group scales in during the night and this morning every alloc is at 100%+ CPU and nothing happens.

Screenshot 2020-01-03 08 55 17

there are 3 allocs from one task group without sherpa meta (suffix -app) and one with sherpa (-spot suffix).

and sherpa keeps logging that

Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-03T07:57:21.373291703Z","message":"triggering autoscaling job evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-03T07:57:21.384312224Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-03T07:57:21.384323537Z","message":"triggering autoscaling job group evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-03T07:57:21.384358656Z","message":"scaling evaluation completed and no scaling required"}

The CPU never goes under 1,3GHz which is 80% of allocated CPU (1600MHz)

If you want I can send you privately the whole log file without redacting it.

jrasell commented 4 years ago

@commarla if you can use james@rasell.systems that would be grand. I'll do more testing today and try and track this down.

Would you be also able to send me what the Nomad API for /v1/job/ + jobID + /allocations shows in regards to the job that is currently going through this problem?

jrasell / sherpa

Scaling not working after scaling in to 1 #120