kubernetes-sigs / metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
Apache License 2.0
5.72k stars 1.86k forks source link

ux: improve scrape error messages (log + wrap with additional error details) #1027

Closed Dentrax closed 1 month ago

Dentrax commented 2 years ago

What would you like to be added:

Improve the following error message to get better readability:

scraper.go:140] "Failed to scrape node" err="request failed, status: \"404 Not Found\"" node="ttskublhrms11"

I don't understand why I'm getting request failed error.

We're proposing to wrap additional errors and adding some logs (request path, response details, wrapping with fmt.Errorf, etc.) would be useful:

Why is this needed:

To better error readability.

cc @eminaktas

/kind feature

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/metrics-server/issues/1027#issuecomment-1272538846): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Dentrax commented 1 year ago

/reopen /remove-lifecycle rotten

k8s-ci-robot commented 1 year ago

@Dentrax: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/metrics-server/issues/1027#issuecomment-1398428217): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dgrisonnet commented 1 year ago

@Dentrax what would be your suggestion to improve the error?

scraper.go:140] "Failed to scrape node" err="request failed, status: \"404 Not Found\"" node="ttskublhrms11"

This current error means that scraping metrics from the node failed because node="ttskublhrms11" was not found.

rexagod commented 1 year ago

Adding more context to error messages may undo some of the work done in https://github.com/kubernetes-sigs/metrics-server/pull/774. Maybe the request here is for a more verbose logging support (triggered by -v flags, so we can retain the current error behaviour, but be verbose if the user demands it)? /assign

rexagod commented 1 year ago

Pinging @Dentrax.

Dentrax commented 1 year ago

Pong!

scraping metrics from the node failed because node="ttskublhrms11" was not found.

@dgrisonnet - actually node was already there with up and running state. 404 Not Found still seem too generic to me.

Maybe the request here is for a more verbose logging support

Definitely! It'd better to provide some context to enlighten the way and would eventually be resulting with reduced troubleshooting time.

As I already dropped in the issue, questions I asked were:

dgrisonnet commented 1 year ago

From my developer perspective the existing error log already answer these questions, but maybe it is not clear enough from a user perspective hence why I wanted to know what you were expecting.

If we take the error log you reported:

Request to where?

Failed to scrape node ... node="ttskublhrms11" => there was a scrape request send to node ttskublhrms11. The scrape request is made to the kubelet /metrics/resource running on node ttskublhrms11. The fact that the request is made to kubelet can be considered an implementation detail which shouldn't be very useful to the users when debugging.

404 of what?

\"404 Not Found\"" node="ttskublhrms11" => this means that while making the scrape request, it returned a 404 because node ttskublhrms11 wasn't present in the cluster.

Why?

The node might've been deleted while metrics-server was trying to grab metrics from it. This doesn't sounds too harmful, it might just be that the list of nodes metrics-server held internally wasn't up-to-date yet.

dgrisonnet commented 1 year ago

/assign /triage accepted

k8s-triage-robot commented 6 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/metrics-server/issues/1027#issuecomment-2269838560): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.