google / cloudprober

[Moved to cloudprober/cloudprober] An active monitoring software to detect failures before your customers do.
https://cloudprober.org
Apache License 2.0
1.4k stars 151 forks source link

[Feature Request] Allow k8s to discover ingress resources #419

Closed networkop closed 4 years ago

networkop commented 4 years ago

Similar to how dynamic targets can be discovered from pods, endpoints and services, it'd be nice to have RDS discover ingress resources as well. The result would be an FQDN with optional set of IP (if resolved by the RDS server).

manugarg commented 4 years ago

Just for the record, ingress's status object is same as service's status object[1]. We can probably just reuse most of the code after some refactoring.

[1] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#ingress-v1beta1-networking-k8s-io

networkop commented 4 years ago

Yes, in fact, this status is copied from an underlying LB service by the ingress controller (e.g. [1]). However, I think we still need to know (and pass to the client) the spec.rules[*].host to build the correct HTTP host header.

[1] https://github.com/kubernetes/ingress-nginx/blob/3eafaa35a171e2d91d75690461ef45e721411aec/internal/ingress/status/status.go#L237

manugarg commented 4 years ago

However, I think we still need to know (and pass to the client) the spec.rules[*].host to build the correct HTTP host header.

That's an interesting detail. Thanks @networkop.

There are two interesting problems:

  1. We need to have a way to transfer the host information from ingress definition to the HTTP probe. We can probably use an internal/private label to achieve that, say something like _cloudprober_http_host.

  2. How to handle multiple hosts in ingress definition. Is it common to have multiple hosts in an ingress definition?

networkop commented 4 years ago

We can probably use an internal/private label to achieve that, say something like _cloudprober_http_host.

Yeah, I think something like that should work and it's fairly easy to add.

How to handle multiple hosts in ingress definition. Is it common to have multiple hosts in an ingress definition?

This may not be common but it's definitely possible and even mentioned in the offical k8s ingress docs.

Another questions is how to do DNS resolution for these hosts. The most straigh-forward way is to do it locally on the client, based on information provided in Resource.IP as collected from Ingress.status.LoadBalancer. However, this may result in false negatives in certain situations. If a user relies on external DNS integration and this integration fails for some reason (e.g. IAM permissions changed), then Ingress will not be accessible but cloudprober will continue reporting it as healthy.

infa-kparida commented 4 years ago

This will be really good if cloudprober can discover ingress and then provide the golden signals from those ingress.This will be one of the killer feature of cloudprober if can be implemented

manugarg commented 4 years ago

@infa-kparida thanks for the input. Yes, we are planning to add support for the "ingress" resource type.

I've not yet figured out how to name targets for multiple host rules within an ingress, for example, we can do:

_ or just I guess second option is cleaner, though I think we should provide a way to identify which ingress that particular `host` belongs to. Let me know if you have any opinions on that. Also, we recently added special handling for the target's 'fqdn' label; if present, it becomes the HTTP host header for the probe: https://github.com/google/cloudprober/blob/4bdceab3d43a5e22d1dc19649700f12d65cc5652/probes/http/request.go#L62 We can use this feature and automatically add 'fqdn' label to ingress targets if not present already. > > We can probably use an internal/private label to achieve that, say something like _cloudprober_http_host. > > Yeah, I think something like that should work and it's fairly easy to add. > > > How to handle multiple hosts in ingress definition. Is it common to have multiple hosts in an ingress definition? > > This may not be common but it's definitely possible and even mentioned in the offical k8s [ingress docs](https://kubernetes.io/docs/concepts/services-networking/ingress/#name-based-virtual-hosting). > > Another questions is how to do DNS resolution for these hosts. The most straigh-forward way is to do it locally on the client, based on information provided in `Resource.IP` as collected from `Ingress.status.LoadBalancer`. However, this may result in false negatives in certain situations. If a user relies on [external DNS integration](https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/nginx-ingress.md) and this integration fails for some reason (e.g. IAM permissions changed), then Ingress will not be accessible but cloudprober will continue reporting it as healthy. I think for DNS resolution we don't have to depend on the IP provided in the resource; we can let system resolve the IP address. There is a config option for that (though probably not that clear): `resolve_first`. If resolve_first is not configured, we use system's DNS resolver.
networkop commented 4 years ago

I think <ingress_name>_<host_fqdn> or <ingress_name>_<index> would do, as you can have multiple unique targets defined in a single ingress resource.

Also @manugarg , I was thinking about implementing this locally but then I noticed you've added it to the next milestone. I'm happy to help with the implementation or testing to help speed things up, so please let me know if you need any help.

manugarg commented 4 years ago

Thanks @networkop. I had started a change some time back. I'll try to finish it in a couple of days (it will mostly get submitted around Sep 08 though). I'll let you know how it goes.

manugarg commented 4 years ago

Thanks @networkop. I had started a change some time back. I'll try to finish it in a couple of days (it will mostly get submitted around Sep 08 though). I'll let you know how it goes.

I was over-optimistic to think that I'll get it done over the long weekend :) I've the change out for review (internally) now. It will take a few more days I think.

manugarg commented 4 years ago

Alright, #469 adds support for ingress resources.

Following example in the test shows how resources end up getting named and can be filtered using labels: https://github.com/google/cloudprober/blob/928d3def650642b3fcf1ef845ffd19316a241d7d/rds/kubernetes/ingresses_test.go#L88

Ingress resources get the following two labels:

"fqdn"         : is set to the "host" in the ingress rule.
"relative_url" : is set to the "path" in the ingress rule.

@networkop and @infa-kparida, please let me know if you have any thoughts on it.

networkop commented 4 years ago

Great, thanks @manugarg , I'll give it a spin in the next few days and let you know.

networkop commented 4 years ago

Hi @manugarg , here are a few issues I've picked up so far:

  1. RDS now needs and updated set of RBAC rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
  name: rds-reader
rules:
- apiGroups: [""]
  resources: ["*"]
  verbs: ["get", "list"]
- apiGroups:
  - extensions
  - "networking.k8s.io" # k8s 1.14+
  resources:
  - ingresses
  - ingresses/status
  verbs: ["get", "list"]
  1. URL is malformed when the client uses custom surfacer.prometheus_surfacer.metrics_prefix. For example if the prefix is set to cloudprober_, the URL becomes: http://cloudprober_google.com

  2. HTTP path gets appended to the URL incorrectly. E.g. if I have an ingress with path /download, the URL becomes: http://google.com__download

networkop commented 4 years ago

another issue is that the URL that gets built and the client tries to resolve is in the format ingressName_ingressHostname, e.g. my-ingress_google.com

networkop commented 4 years ago

seems like the above DNS resolution issues are fixed with http_proxy.resolve_first: true

manugarg commented 4 years ago

Thanks a lot for checking @networkop.

RDS now needs and updated set of RBAC rules:

Yeah, this is expected. I'll update the document.

URL is malformed when the client uses custom surfacer.prometheus_surfacer.metricsprefix. For example if the prefix is set to cloudprober, the URL becomes: http://cloudprober_google.com

I didn't get this one. This prefix is added to the metric names. Which URL is this?

HTTP path gets appended to the URL incorrectly. E.g. if I have an ingress with path /download, the URL becomes: http://google.com__download

another issue is that the URL that gets built and the client tries to resolve is in the format ingressName_ingressHostname, e.g. my-ingress_google.com

So both of these issues were supposed to be resolved by target labels. HTTP probe should use "fqdn" and "relative_url" labels, which are added by RDS automatically ("fqdn" label is used for URL host as well as host header).

https://github.com/google/cloudprober/blob/928d3def650642b3fcf1ef845ffd19316a241d7d/probes/http/request.go#L71

Just to confirm, you tried with cloudprober head, right? Not just the ingress change.

I'll investigate more. Thanks a lot for giving it a try. That's very helpful.

manugarg commented 4 years ago

@networkop I tried running an HTTP probe like this:

probe {
  name: "pod-to-ingress"
  type: HTTP

  targets {
    rds_targets {
      resource_path: "k8s://ingresses"
      filter {
        key: "namespace"
        value: "default"
      }
    }
  }

  additional_label {
    key: "fqdn"
    value: "@target.label.fqdn@"
  }

  additional_label {
    key: "probeurl"
    value: "@target.label.relative_url@"
  }

  http_probe {}
}

Where ingress is:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: rds-ingress
spec:
  rules:
  - host: foo.bar.com
    http:
      paths:
      - path: /health
        backend:
          serviceName: cloudprober-test
          servicePort: 9313
  - host: prometheus.bar.com
    http:
      paths:
      - path: /
        backend:
          serviceName: cloudprober-test
          servicePort: 9313

With this I get logs like this:

2020-09-16 12:17:31.005 PDT
Target:foo.bar.com, URL:http://foo.bar.com/health, http.doHTTPRequest: Get "http://foo.bar.com/health": dial tcp: lookup foo.bar.com on 10.31.240.10:53: no such host
Expand all | Collapse all{
 insertId: "1sxfsuig1rbeq6l"  
 logName: "projects/google.com:bbmc-stackdriver/logs/cloudprober.pod-to-ingress"  
 receiveTimestamp: "2020-09-16T19:17:41.024229028Z"  
 resource: {…}  
 severity: "WARNING"  
 textPayload: "Target:foo.bar.com, URL:http://foo.bar.com/health, http.doHTTPRequest: Get "http://foo.bar.com/health": dial tcp: lookup foo.bar.com on 10.31.240.10:53: no such host"  
 timestamp: "2020-09-16T19:17:31.005359502Z"  
}

Notice that the URL is correct here.

And metrics look like this (notice additional labels here):

total{ptype="http",probe="pod-to-ingress",dst="rds-ingress_foo.bar.com__health",fqdn="foo.bar.com",probeurl="/health"} 25 1600284207212
total{ptype="http",probe="pod-to-ingress",dst="rds-ingress_prometheus.bar.com",fqdn="prometheus.bar.com",probeurl="/"} 25 1600284207212
success{ptype="http",probe="pod-to-ingress",dst="rds-ingress_foo.bar.com__health",fqdn="foo.bar.com",probeurl="/health"} 0 1600284207212
success{ptype="http",probe="pod-to-ingress",dst="rds-ingress_prometheus.bar.com",fqdn="prometheus.bar.com",probeurl="/"} 0 1600284207212
latency{ptype="http",probe="pod-to-ingress",dst="rds-ingress_foo.bar.com__health",fqdn="foo.bar.com",probeurl="/health"} 0.000 1600284207212
latency{ptype="http",probe="pod-to-ingress",dst="rds-ingress_prometheus.bar.com",fqdn="prometheus.bar.com",probeurl="/"} 0.000 1600284207212
timeouts{ptype="http",probe="pod-to-ingress",dst="rds-ingress_foo.bar.com__health",fqdn="foo.bar.com",probeurl="/health"} 0 1600284207212
timeouts{ptype="http",probe="pod-to-ingress",dst="rds-ingress_prometheus.bar.com",fqdn="prometheus.bar.com",probeurl="/"} 0 1600284207212
networkop commented 4 years ago

hm.. maybe they were false positives. Let me retest tomorrow and report back.

networkop commented 4 years ago

it looks like it was false positives. I may have had DNS not fully functioning at one point and I looked at dst and assumed that's what was getting resolved. So far I've reverted back all of the workarounds I put in place, including the metrics_prefix and everything seems to be working fine.

manugarg commented 4 years ago

Thanks so much for confirming @networkop.

I'll go ahead mark this issue closed. We can re-open this or file another one if we find a problem with the ingress resources discovery.