Closed commarla closed 4 years ago
After my manual scaling out to fix the issue I don't see this log anymore : failed to collect Nomad metrics, skipping Nomad based checks
@commarla thanks for the report; i'll take a look this afternoon.
I have tried to reproduce this locally without success, so will keep trying over the next day or so. Using the nomad meta policy engine the example job was scaled down to its minimum allowed count of 1:
Allocations
ID Node ID Task Group Version Desired Status Created Modified
43cd970f 3d79e0f6 cache 3 run running 5m8s ago 2m19s ago
75fd797a 3d79e0f6 cache 2 stop complete 6m20s ago 3m19s ago
c41c66d8 3d79e0f6 cache 1 stop complete 16m33s ago 5m7s ago
e9d80206 3d79e0f6 cache 0 stop complete 16m33s ago 14m46s ago
Under minimal load Sherpa wanted to scale-in, but was prevented by the scaling policy:
5:17PM INF Nomad resource utilisation calculation cpu-value-percentage=10.466054549222797 group=cache job=example2 mem-value-percentage=0
5:17PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:17PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"in","group":"cache"}
5:17PM DBG scaling action will break job group minimum threshold group=cache job=example2
5:17PM ERR failed to trigger scaling of job error="scaling action will break job group minimum threshold" job=example2
I then loaded the service and triggered a successful scaling out event:
5:23PM DBG scaling evaluation completed, handling scaling request based on Nomad checks job=example2
5:23PM DBG added group scaling request group=cache job=example2 scaling-req={"count":1,"direction":"out","group":"cache"}
5:23PM DBG meta watcher last index has changed new=83 old=72
5:23PM DBG job modify index has changed is greater than last recorded job=example2 new=83 old=72
5:23PM DBG received job list update message to handle
5:23PM DBG reading job group meta stanzas job=example2
5:23PM DBG meta watcher last index has changed new=85 old=83
5:23PM INF successfully triggered autoscaling of job evaluation-id=e0155fc9-7414-11c9-67ac-39232bcd8992 id=dc7e8685-3bd0-410d-ac61-9a4b2f1765cc job=example2
From the description you gave, it is almost as if the meta engine incorrectly removed the policy during the last scaling event. Does the job contain multiple groups or just a single one? Are there any more debug logs from the times you were having issues that I could look at?
Thanks @jrasell There are 2 groups and I don't see anything else in the logs.
I also already got this issue with other jobs but didn't took time to fill an issue. All my jobs with sherpa meta have at least 2 task groups.
Is there a way to check if sherpa needs more threads or some tuning ?
Hi,
It is live. My task group scales in during the night and this morning every alloc is at 100%+ CPU and nothing happens.
there are 3 allocs from one task group without sherpa meta (suffix -app
) and one with sherpa (-spot
suffix).
and sherpa keeps logging that
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-03T07:57:21.373291703Z","message":"triggering autoscaling job evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-03T07:57:21.384312224Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-03T07:57:21.384323537Z","message":"triggering autoscaling job group evaluation"}
Jan 03 08:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-03T07:57:21.384358656Z","message":"scaling evaluation completed and no scaling required"}
The CPU never goes under 1,3GHz which is 80% of allocated CPU (1600MHz)
If you want I can send you privately the whole log file without redacting it.
@commarla if you can use james@rasell.systems that would be grand. I'll do more testing today and try and track this down.
Would you be also able to send me what the Nomad API for /v1/job/ + jobID + /allocations
shows in regards to the job that is currently going through this problem?
Describe the bug Sometimes when one of my task group is scaled in to 1 allocation, sherpa seems to forgot the policies and doesn't scale anymore.
To reproduce I am using nomad meta.
The job is scaling in to 1 then never scale out
I just need to increase manually the count of the task group from 1 to 2 and sherpa continue to scale out normally.
In this example the service was stuck for 15+ days. In 0.3.0 I have the following logs:
After upgrading to 0.4.0
Expected behavior A scaling out event.
Environment:
sherpa system info
):sherpa --version
):