prometheus / docs

Prometheus documentation: content and static site generator
https://prometheus.io
Apache License 2.0
665 stars 1.02k forks source link

Naming conventions: Refine the recommendations. #2469

Open beorn7 opened 5 months ago

beorn7 commented 5 months ago

https://github.com/prometheus/prometheus/issues/8718 discusses "misnamed" metrics and comes to the conclusion that their names are actually fine and we should improve the recommendations for naming metrics to match the actually existing "fine" metric names.

So the task here is to turn the discussion in https://github.com/prometheus/prometheus/issues/8718 into changes of the metric naming best practices page.

Gopi-eng2202 commented 1 week ago

Hi @beorn7 , I have a couple of questions here,

  1. Is this naming issue only for failed units? (Thats what i understood from the #8718 issue)
  2. If so, would it be a good idea to add a point in the docs page saying that, A metric name that deals with failed units must have the keyword "_failed" right after the name of the unit. Examples:

If you think my understanding is right here, i can work on this and create a PR.

beorn7 commented 1 week ago

I'm not sure about the precise answers to your questions. I guess finding the answers is part of the task here. @juliusv and @SuperQ might be better suited to loop in here.

All I can say is that we want the best practices worded in a way that metric names that are in fact fine should be covered by the best practices. From my limited understanding, this is not at all about metrics for "failed" things or the _failed suffix (ir infix) in particular. I think this is more about defining what a "unit" is in the Prometheus context. Once it is clarified that "truncations" is not seen as a unit in prometheus_tsdb_head_truncations_failed_total, maybe the problem is solved already. The aspect of sorting related metrics together by moving their difference to a position in the name as late as possible might be something to mention in the best practices, but again, this is not specific to _failed.

In different news, I think we still want to keep "real" units going last even for metrics that have failed in their name. For example, if there is a metric called request_size_bytes_total, and we want the size of failed requests in a separate counter, I could see different ways of calling it:

My gut feeling right now is to not overregulate beyond "an actual unit (not "truncations") should go last" as 1st priority and "take sorting into account as it fits your use case" as 2nd priority. But as said, others will have stronger and better justified opinions.

@Gopi-eng2202 I'll assign you to this issue, and maybe you could just draft something up in a PR and nominate @SuperQ and @juliusv as reviewers to see what they think.

Gopi-eng2202 commented 1 week ago

Ok , i got it. I'll work on it. Thanks