Remove old ingress-rules metrics for prometheus scraping

SilentEntity commented 8 months ago

What happened:

Once you update the ingress rule. The Ingress controller is still providing metrics for old rules (plus new rules), which increases cardinality and generates not-useful (dumb) data (for old removed rules) while Prometheus scrapes on the pod.

What you expected to happen:

Once the rules are updated or removed, the metrics from the old data should be removed, which reduces the cardinality and avoids providing not-useful data (for old removed/updated rules).

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

Kubernetes version (use kubectl version): Not relevant

Environment:

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): not relevant
Kernel (e.g. uname -a): not relevant
Install tools: EKS, AKS and bare metal
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
Basic cluster related info:
- kubectl version
- kubectl get nodes -o wide
How was the ingress-nginx-controller installed:
- If helm was used then please show output of helm ls -A | grep -i ingress
- If helm was used then please show output of helm -n <ingresscontrollernamespace> get values <helmreleasename>
- If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
- if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances

How to reproduce this issue:

Add 100 rules, update the same rule, or reduce them to 10. The Ingress controller will provide the metrics data for old and new rules.

Increase in cardinality:

cat metrics | grep -v "#" |cut -d "{" -f1  | sort | uniq -c | sort -rn | head -n40
3048 nginx_ingress_controller_request_duration_seconds_bucket
2988 nginx_ingress_controller_response_duration_seconds_bucket
2988 nginx_ingress_controller_connect_duration_seconds_bucket
2820 nginx_ingress_controller_header_duration_seconds_bucket
2794 nginx_ingress_controller_response_size_bucket
2794 nginx_ingress_controller_request_size_bucket
2032 nginx_ingress_controller_bytes_sent_bucket
 254 nginx_ingress_controller_response_size_sum
 254 nginx_ingress_controller_response_size_count
 254 nginx_ingress_controller_requests
 254 nginx_ingress_controller_request_size_sum
 254 nginx_ingress_controller_request_size_count
 254 nginx_ingress_controller_request_duration_seconds_sum
 254 nginx_ingress_controller_request_duration_seconds_count
 254 nginx_ingress_controller_bytes_sent_sum
 254 nginx_ingress_controller_bytes_sent_count
 249 nginx_ingress_controller_response_duration_seconds_sum
 249 nginx_ingress_controller_response_duration_seconds_count
 249 nginx_ingress_controller_connect_duration_seconds_sum
 249 nginx_ingress_controller_connect_duration_seconds_count
 235 nginx_ingress_controller_header_duration_seconds_sum
 235 nginx_ingress_controller_header_duration_seconds_count

After you restart the pod:

cat metrics | grep -v "#" |cut -d "{" -f1  | sort | uniq -c | sort -rn | head -n40
 288 nginx_ingress_controller_response_duration_seconds_bucket
 288 nginx_ingress_controller_request_duration_seconds_bucket
 288 nginx_ingress_controller_header_duration_seconds_bucket
 288 nginx_ingress_controller_connect_duration_seconds_bucket
 264 nginx_ingress_controller_response_size_bucket
 264 nginx_ingress_controller_request_size_bucket
 192 nginx_ingress_controller_bytes_sent_bucket
  24 nginx_ingress_controller_response_size_sum
  24 nginx_ingress_controller_response_size_count
  24 nginx_ingress_controller_response_duration_seconds_sum
  24 nginx_ingress_controller_response_duration_seconds_count
  24 nginx_ingress_controller_requests
  24 nginx_ingress_controller_request_size_sum
  24 nginx_ingress_controller_request_size_count
  24 nginx_ingress_controller_request_duration_seconds_sum
  24 nginx_ingress_controller_request_duration_seconds_count
  24 nginx_ingress_controller_header_duration_seconds_sum
  24 nginx_ingress_controller_header_duration_seconds_count
  24 nginx_ingress_controller_connect_duration_seconds_sum
  24 nginx_ingress_controller_connect_duration_seconds_count
  24 nginx_ingress_controller_bytes_sent_sum
  24 nginx_ingress_controller_bytes_sent_count
  21 nginx_ingress_controller_ingress_upstream_latency_seconds
  19 nginx_ingress_controller_orphan_ingress
   7 nginx_ingress_controller_ingress_upstream_latency_seconds_sum
   7 nginx_ingress_controller_ingress_upstream_latency_seconds_count

Anything else we need to know:

k8s-ci-robot commented 8 months ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 8 months ago

/help

@SilentEntity thanks for reporting this.

Yes, you are right and this has been going on for a long time
Another typical example is expired cert will continue showing up, even after the related ingress is deleted
But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future

So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions

/assign

k8s-ci-robot commented 8 months ago

@longwuyuan: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/11047): >/help > >@SilentEntity thanks for reporting this. > >- Yes, you are right and this has been going on for a long time >- Another typical example is expired cert will continue showing up, even after the related ingress is deleted >- But personally I am waiting for clarity from someone on the aspect of the data being a timeseries. The context being, the old rule metrics being present and the metrics from a deleted ingress's cert being present are timeseries data that a user may continue to view in grafana (or get from raw prometheus), in future > >So I don't think this is a bug unless we can discuss and triage it to be a bug. So lets wait for expert comments and opinions > >/assign Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 8 months ago

/remove-kind bug

github-actions[bot] commented 7 months ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

SilentEntity commented 7 months ago

Old or expired metrics data, anyhow won't be present in the new pod(while scaling) or restarted pod which will create discrepancies in the metrics or grafana dashboard.

jakuboskera commented 6 months ago

+1

kubernetes / ingress-nginx

Remove old ingress-rules metrics for prometheus scraping #11047

Guidelines