hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.24k stars 4.42k forks source link

SPIFFE ID is not in the expected format #15668

Closed ferhatvurucu closed 1 year ago

ferhatvurucu commented 1 year ago

Hi,

I am upgrading my consul servers from 1.13.3 to 1.14.2 and facing an issue with the grpc_tls configuration. Port configuration has been changed as below however I am still able to see error logs about agent.cache and agent.server.cert-manager. We were already using grpc tls disabled and we added grpc_tls: -1 configuration with the new version.

I don't see any error when I disable connect feature in the configuration file.

Configuration

"ports": {
    "grpc": 8502,
    "grpc_tls": -1
  }

Journalctl logs

[WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="rpc error making call: SPIFFE ID is not in the expected format: spiffe://xxx.consul/agent/server/dc/dc1" index=0
[ERROR] agent.server.cert-manager: failed to handle cache update event: error="leaf cert watch returned an error: rpc error making call: SPIFFE ID is not in the expected format: spiffe://xxx.consul/agent/server/dc/dc1"

Consul info for Server

build:
    prerelease =
    revision = 0ba7a401
    version = 1.14.2
    version_metadata =
consul:
    acl = disabled
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = x.x.x.x:8300
    server = true

. . .

serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 12
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1059145
    members = 10
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 14555
    members = 4
    query_queue = 0
    query_time = 1

Operating system and Environment details

Ubuntu 22.04

jkirschner-hashicorp commented 1 year ago

Hi @ferhatvurucu,

I have a few follow-up questions that may help reveal what's happening here:

ferhatvurucu commented 1 year ago

Hi @jkirschner-hashicorp,

Thanks for the quick reply. It's just for service discovery at the moment. There is no auto-encrypt or auto-config. You may find the server agent configuration below.

{
  "advertise_addr": "x.x.x.x",
  "bind_addr": "x.x.x.x",
  "bootstrap_expect": 3,
  "client_addr": "0.0.0.0",
  "datacenter": "dc1",
  "node_name": "xxxx",
  "retry_join": [
    "provider=aws region=eu-west-1 tag_key=ServiceType tag_value=consul-server"
  ],
  "server": true,
  "encrypt": "xxxx",
  "autopilot": {
    "cleanup_dead_servers": true,
    "last_contact_threshold": "200ms",
    "max_trailing_logs": 250,
    "server_stabilization_time": "10s",
    "redundancy_zone_tag": "az",
    "disable_upgrade_migration": false,
    "upgrade_version_tag": ""
  },
  "ports": {
    "grpc": 8502,
    "grpc_tls": -1
  },
  "connect": {
    "enabled": true
  },
  "ui": true
}
jkirschner-hashicorp commented 1 year ago

My understanding is that you'd only need the grpc ports for:

  1. Consul client agent / dataplane proxy communication with Envoy proxies
  2. Cluster peering communication between Consul server agents in different datacenters

If you're only using Consul for service discovery, (1) shouldn't apply to you. Do you have a multi-datacenter Consul deployment? If so, do you know if it's using WAN federation or cluster peering to connect the multiple datacenters?

It's also possible that the grpc port isn't needed at all. Was there a set of docs / tutorials you followed that suggested you might need that port? I'm wondering if there's a small docs improvement to be made here.

Was your ports config as of 1.13.3 set like the below?

  "ports": {
    "grpc": 8502
  },
jkirschner-hashicorp commented 1 year ago

Leaving some breadcrumbs for the future based on some initial digging into the code:

When tracking down what generates the SPIFFE ID related error message, I found that it attempts to match SPIFFE IDs against these regexes: https://github.com/hashicorp/consul/blob/c046d1a4d870639227baff629ff304a1b72deede/agent/connect/uri.go#L23-L30

Per your error message, the SPIFFE ID being matched against is: xxx.consul/agent/server/dc/dc1.

That SPIFFE ID is closest to the format of spiffeIDServerRegexp, but fails to match because it doesn't start exactly with /agent ... it instead has xxxx.consul in front.


I have no particular experience with this area of the codebase, so I'm not sure what would cause a SPIFFE ID of xxx.consul/agent/server/dc/dc1 to be generated (and whether that's expected behavior). The above is just what I found digging through the code where that error message seems to be generated.

jkirschner-hashicorp commented 1 year ago

I've since seen indication that a SPIFFE ID in the form xxx.consul/agent/server/dc/dc1 is normal, so it's probable that my comments above are based on a misreading of the relevant code. I'll still leave the comments there in case they are relevant for future readers / investigation.

ferhatvurucu commented 1 year ago

We are not actively using consul connect yet however it's a plan for the early future. Even if we disabled grpc tls, we still see the error message above. So with these settings, how can I enable consul connect and keep using tls disabled for now?

It seems this is related to https://github.com/hashicorp/nomad/issues/15360

jkirschner-hashicorp commented 1 year ago

Which Nomad version are you using? Per the Consul 1.14.x upgrade docs:

The changes to Consul service mesh in version 1.14 are incompatible with Nomad 1.4.2 and earlier. If you operate Consul service mesh using Nomad 1.4.2 or earlier, do not upgrade to Consul 1.14 until hashicorp/nomad#15266 is fixed.

ferhatvurucu commented 1 year ago

We upgraded to Nomad 1.4.3 and Consul 1.14.2 respectively.

jkirschner-hashicorp commented 1 year ago

Were you on Nomad 1.4.3 at the time you reported this issue? Or just upgraded now?

It sounds like the former, but wanted to double-check.

ferhatvurucu commented 1 year ago

We were already on Nomad 1.4.3.

0xbentang commented 1 year ago

I had the same error after upgrading, adding this in the agent config seems to fix it

peering {
  enabled = false
}

Prior to Consul 1.14, cluster peering or Consul connect were disabled by default. A breaking change was made in Consul 1.14 that:

Cluster Peering is enabled by default. Cluster peering and WAN federation can coexist, so there is no need to disable cluster peering to upgrade existing WAN federated datacenters. To disable cluster peering nonetheless, set peering.enabled to false.