hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.94k stars 1.96k forks source link

rpc error: rpc: can't find service Namespace.ListNamespaces on non authoritative clusters #14105

Closed Alagroc closed 2 years ago

Alagroc commented 2 years ago

Nomad version

Nomad v1.2.5 (06d912a20ba029c7f4d6b371cd07594cba3ae3cd)

Operating system and Environment details

Debian GNU/Linux 10 (buster)

Issue

rpc error: rpc: can't find service Namespace.ListNamespaces on non authoritative clusters appearing when authoritative_region is specified on the non authoritative nomad servers.

Reproduction steps

1 - Create an authoritative nomad cluster, i.e. staging: nomad.hcl

[...]
region      = "staging"
datacenter  = "aws"
server {
  enabled          = true
  authoritative_region = "staging"
[...]

2 - Create a non authoritative nomad cluster under the first one (i.e. sub environment), make use of the parameter authoritative_region to point to the authoritative environment. i.e. creating staging-b1 and pointing the region to staging nomad.hcl

[...]
region      = "staging-b1"
datacenter  = "aws"
server {
  enabled          = true
  authoritative_region = "staging"
[...]

3 - The cluster works, but this keep appearing in nomad logs:

"failed to fetch namespaces from authoritative region: error="rpc error: rpc: can't find service Namespace.ListNamespaces"

Expected Result

No rpc errors on the logs

Actual Result

RPC errors on the logs

Job file (if appropriate)

Nomad Server logs (if appropriate)

i.e. we have staging cluster as authoritative (formed by a couple of nomad servers), then we have staging b1 and staging b2 as non-authoritative sub environments. Each sub environment is made of a single nomad server. This is part of the output of staging b1:

Aug 11 20:35:15 nom-srv-stagingb1-01 nomad[976]:     2022-08-11T20:35:15.978Z [INFO]  nomad.raft: election won: tally=1
Aug 11 20:35:15 nom-srv-stagingb1-01 nomad[976]:     2022-08-11T20:35:15.978Z [INFO]  nomad.raft: entering leader state: leader="Node at xxxx:xxx [Leader]"
Aug 11 20:35:15 nom-srv-stagingb1-01 nomad[976]:     2022-08-11T20:35:15.978Z [INFO]  nomad: cluster leadership acquired
Aug 11 20:35:17 nom-srv-stagingb1-01 nomad[976]:     2022-08-11T20:35:17.169Z [ERROR] nomad: skipping adding Raft peer because an existing peer is in bootstrap mode and only one server should be in bootstrap mode: existing_peer=nom-srv-stagingb2-01.[...] joining_peer=nom-srv-stagingb1-01.[...]
Aug 11 20:35:17 nom-srv-stagingb1-01 nomad[976]:     2022-08-11T20:35:17.190Z [ERROR] nomad: failed to fetch namespaces from authoritative region: error="rpc error: rpc: can't find service Namespace.ListNamespaces"
[same RPC error a few times a minute]

Nomad Client logs (if appropriate)

no related errors on clients

jrasell commented 2 years ago

Hi @Alagroc. Could you please confirm whether ACLs are enabled and bootstrap on the authoritative and federated clusters with the ACL replication token set on the federated cluster servers?

Could you also please include additional logs from the servers in the authoritative region? The current snippet of logs makes it hard to identify what is exactly happening.

Alagroc commented 2 years ago

Hi @jrasell these are the nomad config file for the authoritative server and its log. I just noticed the authoritative server is quite outdated (version 0.12.3 instead of 1.2.5, this detailed slipped trough).

Config:

data_dir    = "/opt/nomad"

addresses {
  http  = [redacted]
  rpc   = [redacted]
  serf  = [redacted]
}

acl {
  enabled           = true
  replication_token = [redacted]
}

ports {
  http = [redacted]
}

region      = "staging"
datacenter  = "aws"

tls {
  http = true
  rpc  = true

  ca_file   =[redacted]
  cert_file = [redacted]
  key_file  = [redacted]
}

server {
  enabled          = true
  authoritative_region = "staging"

  #minimum times in terminal state before garbage collection
  job_gc_threshold = "48h"
  eval_gc_threshold = "48h"

  bootstrap_expect = 3
  server_join {
    retry_max       = 3
    retry_interval  = "15s"
    retry_join      = ["XXX.XXX.XXX.XXX", "XXX.XXX.XXX.XXX", "XXX.XXX.XXX.XXX"]
  }
}

Log content:

Aug 11 20:35:04 nom-srv-master-staging-01 nomad[5744]:     2022-08-11T20:35:04.352Z [ERROR] nomad.rpc: RPC error: error="rpc: can't find service Namespace.ListNamespaces" connection="&{[redacted] {{0 0 <nil>}} {{0 0 <nil>}}}"
Aug 11 20:35:15 nom-srv-master-staging-01 nomad[5744]:     2022-08-11T20:35:15.051Z [INFO]  nomad: serf: EventMemberUpdate: nom-srv-stagingb1-01.[fqdn]
Aug 11 20:35:18 nom-srv-master-staging-01 nomad[5744]:     2022-08-11T20:35:18.142Z [ERROR] http: request failed: method=GET path=/v1/namespaces?region=stagingb1 error="Nomad Enterprise only endpoint" code=501
Aug 11 20:35:19 nom-srv-master-staging-01 nomad[5744]:     2022-08-11T20:35:19.555Z [ERROR] http: request failed: method=GET path=/v1/status/leader?region=stagingb1 error="rpc error: stream closed" code=500
Aug 11 20:35:19 nom-srv-master-staging-01 nomad[5744]:     2022-08-11T20:35:19.621Z [ERROR] http: request failed: method=GET path=/v1/namespaces?region=stagingb1 error="Nomad Enterprise only endpoint" code=501
Aug 11 20:35:42 nom-srv-master-staging-01 nomad[5744]:     2022-08-11T20:35:42.374Z [ERROR] nomad.rpc: RPC error: error="rpc: can't find service Namespace.ListNamespaces" connection="&{[redacter] {{0 0 <nil>}} {{0 0 <nil>}}}"
jrasell commented 2 years ago

Hi @Alagroc and thanks for the additional information.

Nomad namespaces were originally an enterprise feature and were open sourced in v1.0.0. This is why your setup is not currently functioning as expected. This can be seen in the log line Aug 11 20:35:19 nom-srv-master-staging-01 nomad[5744]: 2022-08-11T20:35:19.621Z [ERROR] http: request failed: method=GET path=/v1/namespaces?region=stagingb1 error="Nomad Enterprise only endpoint" code=501.

I would therefore suggest upgrading your authoritative region to v1.2.5 so that it matches the federated cluster version and includes the namespace OSS code.

I will close this issue as I believe the version mismatch is the source of the problem. If you have further problems, please do not hesitate to reopen this issue, or raise a new one.

Alagroc commented 2 years ago

thanks !

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.