linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.5k stars 1.26k forks source link

Add dst_target* labels on outbound pod-to-pod multicluster requests #11551

Open klingerf opened 8 months ago

klingerf commented 8 months ago

What problem are you trying to solve?

This request is similar to #8134, and relates to #8003.

I'm trying to join inbound and outbound metrics for a pod-to-pod cross-cluster request, and the set of outbound labels provided by linkerd on pod-to-pod cross-cluster requests differs from the set of outbound labels provided by linkerd for standard cross-cluster requests. It would be easier to join these metrics if they exported the same set of labels.

For example, if my application makes a standard cross-cluster request, I see the following labels on the response_total timeseries:

response_total{
  authority="world-west.default.svc.cluster.local:7778",
  classification="failure",
  direction="outbound",
  dst_namespace="default",
  dst_service="world-west",
  dst_target_cluster="west",
  dst_target_service="world",
  dst_target_service_namespace="default",
  error="",
  grpc_status="",
  server_id="linkerd-gateway.linkerd-multicluster.serviceaccount.identity.linkerd.cluster.local",
  status_code="500",
  target_addr="52.226.236.246:4143",
  target_ip="52.226.236.246",
  target_port="4143",
  tls="true",
}

Whereas if my application makes a pod-to-pod cross-cluster request, I see this set of labels:

response_total{
  authority="world-west.default.svc.cluster.local:7778",
  classification="success",
  direction="outbound",
  dst_control_plane_ns="linkerd",
  dst_deployment="world",
  dst_namespace="default",
  dst_pod="world-69dbbbf799-mw9ww",
  dst_pod_template_hash="69dbbbf799",
  dst_service="world",
  dst_serviceaccount="default",
  error="",
  grpc_status="",
  server_id="default.default.serviceaccount.identity.linkerd.cluster.local",
  status_code="200",
  target_addr="10.23.0.8:7778",
  target_ip="10.23.0.8",
  target_port="7778",
  tls="true",
}

You can see that the dst_target* labels are missing from the second timeseries. In fact, the second timeseries looks almost identical to an in-cluster request, which makes this even more confusing:

response_total{
  classification="success",
  direction="outbound",
  dst_control_plane_ns="linkerd",
  dst_deployment="hello",
  dst_namespace="default",
  dst_pod="hello-7689c99f6c-8xwnm",
  dst_pod_template_hash="7689c99f6c",
  dst_serviceaccount="default",
  error="",
  grpc_status="",
  server_id="default.default.serviceaccount.identity.linkerd.cluster.local",
  status_code="200",
  target_addr="10.244.2.55:7777",
  target_ip="10.244.2.55",
  target_port="7777",
  tls="true",
}

How should the problem be solved?

When a cross-cluster pod-to-pod request is made, it would be great to include all dst_target* labels that are appropriate, but at a minimum, the dst_target_cluster label.

Alternatively, each type of request (in-cluster, pod-to-pod cross-cluster, standard cross-cluster) could be exported in a different timeseries, with different sets of labels. That would avoid the need to inspect individual labels to know what kind of request it is.

Any alternatives you've considered?

We can derive the value of the dst_target_cluster label from the existing labels, but it seems super brittle and only works for HTTP requests, since it would rely on the authority label. From the pod-to-pod cross-cluster request labels above, we have:

  authority="world-west.default.svc.cluster.local:7778",
  dst_service="world",

The first segment of the authority header is the name of the service to which this request was sent, and that differs from the value of the dst_service label. If we assume that mirrored services are always named as <dst_service>-<target_cluster>, then we can strip the dst service from the authority header to get the target cluster name ("west", in this case).

How would users interact with this feature?

Via metrics

Would you like to work on this feature?

no

stale[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.