linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.69k stars 1.28k forks source link

Prometheus metrics federation yields HTTP 403 #11050

Open adleong opened 1 year ago

adleong commented 1 year ago

Discussed in https://github.com/linkerd/linkerd2/discussions/11044

Originally posted by **ngc4579** June 21, 2023 Using the Prometheus federation API as advertised in the [docs](https://linkerd.io/2.13/tasks/exporting-metrics/#federation) yields an HTTP 403 scrape error (`server returned HTTP status 403 Forbidden`). IIRC this used to work some time ago. Were there any (recent) changes that are possibly not reflected in the docs? What might cause the described behaviour?
adleong commented 1 year ago

Thanks for raising this, @ngc4579. It's possible that additional AuthorizationPolicies are needed for Prometheus federation. This will require some investigation.

wmorgan commented 1 year ago

This policy was suggested by Michelle B on the Linkerd Slack (link will expire in 90 days):

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: prometheus-admin-federate
  namespace: linkerd-viz
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: prometheus-admin
  requiredAuthenticationRefs:
    - group: policy.linkerd.io
      kind: NetworkAuthentication
      name: kubelet
ngc4579 commented 1 year ago

Thanks so much @adleong @wmorgan for your answers. The mentioned AuthorizationPolicy actually did help, federation works as expected now. If this policy is intentionally required, I guess this should be reflected in the docs. (Or else, if it already is, it seems I wasn't able to find it. :) )

prajithp13 commented 1 year ago

We have setup the linkerd-viz with external prometheus and after the upgrade we are getting following errors

time="2023-06-26T12:34:55Z" level=error msg="queryProm failed with: Query failed: \"sum(increase(response_total{deployment=\\\"app-prod-http\\\", direction=\\\"outbound\\\", namespace=\\\"web\\\"}[1m])) by (dst_namespace, dst_deployment, classification, tls)\": Post \"https://external-endpoint/api/v1/query\": context canceled"
alpeb commented 1 year ago

Anybody would like to submit a PR with this policy included? Should be pretty straight-forward.

@prajithp13 Did you apply the policy?

deepto98 commented 1 year ago

@alpeb I'd like to pick this up, I'm learning Linkerd and service meshes in general, would also like to contribute to the project, this seems like a good issue to start with.

alpeb commented 1 year ago

@deepto98 sounds great, please proceed!

alexandreliberato commented 1 year ago

@deepto98 Are you working on this? If not, I will be willing to tackle this issue :)

deepto98 commented 1 year ago

I'll pick this up this week

jderieg commented 1 year ago

Did a PR for this issue ever get created?

ioannatheo commented 11 months ago

Hey is there any progress on this issue?

wmorgan commented 11 months ago

@ioannatheo there is a workaround by adding that policy YAML pasted earlier above. A PR to add that by default would be welcome.

francRang commented 8 months ago

I am actively working on this. I think I have a pretty good understanding on what needs to be done. Track progress: https://github.com/francRang/linkerd2 Give me 1-2 days max and I should be able to get it ready for review.

e1011215 commented 2 months ago

I am assuming @adleong used the helm chart. I used the helm chart and am seeing the same issue.

The kubelet NetworkIdentity is meant for probes from kubelet. The default definition provided in the helm chart is a catch-all (everything will match it) so we are letting everything in. It might work as a workaround only because it is a catch-all (not an effective identity).