hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
427 stars 84 forks source link

allocatable_memory wrong #910

Open hsmade opened 5 months ago

hsmade commented 5 months ago
internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=2924 allocated_memory=5460 allocatable_cpu=35200 allocatable_memory=15608

While I have 3 VMs in my asg, with each 2G of memory. I do have a lot of clients in 'ready' state according to Nomad, but these are all no longer there and are ineligible.

# nomad node-status
ID        Node Pool  DC            Name                                 Class            Drain  Eligibility  Status
54c1d969  default    eu-central-1  worker-services-i-0b39e96bb16ae1c7d  worker-services  false  ineligible   ready
99a91f0c  default    eu-central-1  worker-services-i-0d3bac19ef5aa1d1d  worker-services  false  ineligible   ready
4c394a12  default    eu-central-1  worker-services-i-0582f925f917ec73e  worker-services  false  ineligible   ready
284cc4be  default    eu-central-1  worker-services-i-0dfe7da7ebadd971e  worker-services  false  ineligible   ready
d90e705b  default    eu-central-1  worker-services-i-0a7a4b92ab8d432f7  worker-services  false  eligible     ready
e57184ee  default    eu-central-1  worker-services-i-01e9192ae978c7921  worker-services  false  ineligible   down
1e1c3d34  default    eu-central-1  worker-services-i-03cb071c0352926fe  worker-services  false  ineligible   down
db380668  default    eu-central-1  worker-services-i-007763c53c9ad360c  worker-services  false  eligible     ready
25860043  default    eu-central-1  worker-services-i-0d62b5c0d3689f418  worker-services  false  ineligible   down
d91509f1  default    eu-central-1  worker-services-i-09962f55c4c6d43d7  worker-services  false  ineligible   ready
8e9906f3  default    eu-central-1  worker-services-i-002097b734ca1e620  worker-services  false  ineligible   down
6f615722  default    eu-central-1  worker-services-i-030899c1a2dc0cada  worker-services  false  ineligible   down
24e0a7dc  default    eu-central-1  worker-services-i-0d5f1e86c467bfb34  worker-services  false  ineligible   down
f1af8cee  default    eu-central-1  worker-services-i-02f8f7ce032068ce2  worker-services  false  eligible     ready
ca0f312f  default    eu-central-1  worker-services-i-0df38a7f1dae1914a  worker-services  false  ineligible   down
# nomad node-status 54c1d969

error fetching node stats: Unexpected response code: 404 (No path to node)
ID              = 54c1d969-1409-a1af-bbe8-4af4b67deb3d
Name            = worker-services-i-0b39e96bb16ae1c7d
Node Pool       = default
Class           = worker-services
DC              = eu-central-1
Drain           = false
Eligibility     = ineligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2024-05-31T14:03:38Z  Drain      Node drain complete
2024-05-31T14:03:36Z  Drain      Node drain complete
2024-05-31T14:03:36Z  Drain      Node drain strategy set
2024-05-31T14:03:19Z  Drain      Node drain complete
2024-05-31T14:03:19Z  Drain      Node drain strategy set
2024-05-31T13:59:33Z  Cluster    Node registered

Allocated Resources
CPU         Memory       Disk
0/4400 MHz  0 B/1.9 GiB  0 B/15 GiB

Allocation Resource Utilization
CPU         Memory
0/4400 MHz  0 B/1.9 GiB

error fetching node stats: actual resource usage not present

Because of this, the autoscaler tries to scale in even more, until it hits my min-limit.

hsmade commented 2 months ago

might be related to https://github.com/hashicorp/nomad/issues/13549

jrasell commented 1 month ago

Hi @hsmade and thanks for raising this issue. It looks like we do not check a nodes eligibility when filtering nodes which is used by the Nomad APM to calculate resource totals. I think adding a conditional here, to ensure the node is "eligible" would resolve this problem.