jorgemarey / nomad-nova-autoscaler

MIT License
14 stars 3 forks source link

Unable to scale in due to mismatch between hostname and instance name #5

Closed BastienClement closed 3 months ago

BastienClement commented 3 months ago

Hi! Thank you for all the work done to build this plugin.

I'm trying to setup the autoscaler on Infomaniak Public Cloud to run batch workloads.

I got it working up to the point when jobs complete and we need to scale in the pool and then it simply breaks. Here is sample of the log from the autoscaler:

2024-07-13T11:02:41.302Z [INFO]  policy_eval.worker: scaling target: id=96c185b6-63cc-2a8d-94a8-7220e0a5027f policy_id=a71493ba-9f09-1d6b-4eea-3f7ffd226b9e queue=cluster target=os-nova from=19 to=2 reason="scaling down because factor is 0.079576" meta=map[nomad_policy_id:a71493ba-9f09-1d6b-4eea-3f7ffd226b9e]
2024-07-13T11:02:41.628Z [DEBUG] external_plugin.os-nova: performing node pool filtering: combined_identifier="node_class:batch and datacenter:infomaniak-dc3 and node_pool:batch" timestamp=2024-07-13T11:02:41.628Z
2024-07-13T11:02:41.638Z [DEBUG] external_plugin.os-nova: found node: draining=false eligibility=eligible node_pool=batch status=ready datacenter=infomaniak-dc3 node_class=batch node_id=063c0c50-b038-a041-1929-24de46c317ad timestamp=2024-07-13T11:02:41.637Z
2024-07-13T11:02:41.638Z [DEBUG] external_plugin.os-nova: found node: status=ready datacenter=infomaniak-dc3 node_class=batch node_pool=batch draining=false eligibility=eligible node_id=31bd84f4-a29c-a105-cf24-395751fd4a66 timestamp=2024-07-13T11:02:41.637Z
[...]
2024-07-13T11:02:41.639Z [DEBUG] external_plugin.os-nova: node passed filter criteria: node_id=063c0c50-b038-a041-1929-24de46c317ad timestamp=2024-07-13T11:02:41.638Z
2024-07-13T11:02:41.639Z [DEBUG] external_plugin.os-nova: node passed filter criteria: node_id=31bd84f4-a29c-a105-cf24-395751fd4a66 timestamp=2024-07-13T11:02:41.638Z
[...]
2024-07-13T11:02:41.644Z [DEBUG] external_plugin.os-nova: identified remote provider ID for node: node_id=063c0c50-b038-a041-1929-24de46c317ad remote_id=nomad-batch-ff9c8fc9-7dd4.dc3-a.pub1.infomaniak.cloud timestamp=2024-07-13T11:02:41.643Z
2024-07-13T11:02:41.649Z [DEBUG] external_plugin.os-nova: identified remote provider ID for node: node_id=31bd84f4-a29c-a105-cf24-395751fd4a66 remote_id=nomad-batch-2e25d4d4-b5ab.dc3-a.pub1.infomaniak.cloud timestamp=2024-07-13T11:02:41.649Z
[...]
2024-07-13T11:02:41.752Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=e8ef91d0-3cf6-d2bb-b706-51d958c65504 eval_token=7ce95066-412d-087c-374c-7a3b77c930fc id=96c185b6-63cc-2a8d-94a8-7220e0a5027f policy_id=a71493ba-9f09-1d6b-4eea-3f7ffd226b9e queue=cluster error="failed to scale target: rpc error: code = Unknown desc = failed to perform scaling action: failed to perform pre-scale Nomad scale in tasks: no nodes identified for scaling in action"

What seems to happen here is that on Infomaniak cloud, the unique.platform.aws.hostname attribute is a full hostname (like nomad-batch-e0a3f0ee-5418.dc3-a.pub1.infomaniak.cloud) and not simply the instance name from Openstack.

Then, RunPreScaleInTasksWithRemoteCheck gets confused because id.RemoteResourceId is the full hostname while remoteId is the instance name. Everything is filtered out, and nothing is left to scale in.

I only have the following attributes to work with: Attribute Value
unique.hostname nomad-batch-e0a3f0ee-5418
unique.network.ip-address 10.0.0.144
unique.platform.aws.hostname nomad-batch-e0a3f0ee-5418.dc3-a.pub1.infomaniak.cloud
unique.platform.aws.instance-id i-001787c3 (doesn't seem to match anything from openstack server show ...)
unique.platform.aws.public-hostname nomad-batch-e0a3f0ee-5418.dc3-a.pub1.infomaniak.cloud

Notably, nothing has the instance ID.

I tried to set id_attribute = "unique.hostname" as this one is indeed the instance name. But setting anything as id_attribute also sets t.idMapper = true, forcing me to select the instance ID and not its name. 🙃

For completion's sake, here is the configuration for the plugin:

# autoscaler config
target "os-nova" {
  driver = "os-nova"
  config = {
    auth_url     = "https://api.pub1.infomaniak.cloud/identity"
    username     = "..."
    password     = "..."
    domain_name  = "Default"
    project_id   = "..."
    project_name = "..."
    region_name  = "dc3-a"
    id_attribute = "unique.hostname"
  }
}

# policy
target "os-nova" {
  dry-run = false

  evenly_split_azs    = true
  stop_first          = true
  image_name          = "Fedora Core OS 40"
  flavor_name         = "a4-ram16-disk20-perf1"
  pool_name           = "nomad-autoscaler-batch"
  name_prefix         = "nomad-batch-"
  network_id          = "..."
  user_data_template  = "local/user_data"

  datacenter                    = "infomaniak-dc3"
  node_class                    = "batch"
  node_pool                     = "batch"
  node_drain_deadline           = "1h"
  node_drain_ignore_system_jobs = false
  node_purge                    = true
  node_selector_strategy        = "empty_ignore_system"
}

I don't think there is any solution to the issue with only configuration changes. I am open to submit a pull request, but what would be the prefered way to tackle this?

Thanks

jorgemarey commented 3 months ago

Hi @BastienClement, thanks for reporting this (and for the complete explanation, it's really helpful). I'm sorry that you have encountered this issue. I see the problem. I thought that unique.platform.aws.hostname would always have the name of the instance, but every Openstack installation is different I guess. I'll made necessary changes to fix this in the next few days.

I'll add a name_attribute that defaults to unique.platform.aws.hostname to maintain compatibility but that can be provided in configuration with the value you need. In any case, if id_attribute is set that will take priority over name_attribute (also to maintain compatibility).

You could work around this by adding the id in the property as a meta (we are doing that in our nomad cluster)

client {
  enabled = true
.....
  meta {
      instance_id = "XXXXXXXXXX"
  }
}

We are getting that value from the instance metadata address ("http://169.254.169.254/openstack/latest/meta_data.json") in the field UUID. We get that value in the instance in a process after cloud-init and providing that to the nomad configuration. The here you could provide the id_attribute = meta.instance_id and that should work.

I'll update this issue once a fix is released for this. Thanks for trying this plugin!

BastienClement commented 3 months ago

Thanks for the very quick reply. I've actually came to the same idea later yesterday.

I've deployed a custom build of the plugin with the following changes, and it works like a charm (after realizing that documentation about autoscaler and ACLs is very lackluster and you obviously require node = write to scalin in 😛).

diff --git a/plugin/openstack.go b/plugin/openstack.go
index 53c4fb5..53bcc7e 100644
--- a/plugin/openstack.go
+++ b/plugin/openstack.go
@@ -742,8 +742,12 @@ func (t *TargetPlugin) getInstancePortID(id string) (string, error) {

 // osNovaNodeIDMapBuilder is used to identify the Opensack Nova ID of a Nomad node using
 // the relevant attribute value.
-func osNovaNodeIDMapBuilder(property string) scaleutils.ClusterNodeIDLookupFunc {
+func osNovaNodeIDMapBuilder(config map[string]string) scaleutils.ClusterNodeIDLookupFunc {
        var isMeta bool
+       property := config[configKeyNodeIDAttr]
+       if property == "" {
+               property = config[configKeyNodeNameAttr]
+       }
        if property == "" {
                property = "unique.platform.aws.hostname"
        }
diff --git a/plugin/plugin.go b/plugin/plugin.go
index 55c8f88..c2c4b7b 100644
--- a/plugin/plugin.go
+++ b/plugin/plugin.go
@@ -29,7 +29,8 @@ const (
        configKeyCACertFile  = "cacert_file"
        configKeyInsecure    = "insecure_skip_verify"

-       configKeyNodeIDAttr = "id_attribute"
+       configKeyNodeIDAttr   = "id_attribute"
+       configKeyNodeNameAttr = "name_attribute"

        configKeyName           = "name"
        configKeyNamePrefix     = "name_prefix"
@@ -120,7 +121,7 @@ func (t *TargetPlugin) SetConfig(config map[string]string) error {

        // Store and set the remote ID callback function.
        t.clusterUtils = clusterUtils
-       t.clusterUtils.ClusterNodeIDLookupFunc = osNovaNodeIDMapBuilder(config[configKeyNodeIDAttr])
+       t.clusterUtils.ClusterNodeIDLookupFunc = osNovaNodeIDMapBuilder(config)
        t.idMapper = config[configKeyNodeIDAttr] != ""

        return nil

I'll try building and deploying from your branch instead, but I expect similar results since the code is so similar. Stay tuned.