status-im / nim-metrics

Nim metrics client library supporting the Prometheus monitoring toolkit, StatsD and Carbon
Other
39 stars 6 forks source link

Multiple Outputs in Collector Breaks Exposition Format #80

Open jhwbarlow opened 4 months ago

jhwbarlow commented 4 months ago

Because the Collector API has the specification of the metric type and the help text as part of the newCollector() call, it is not possible to have multiple output() calls in one collector without breaking the exposition format standard.

This is because the help text and type will only be printed once for the entire collector, and not once per-metric. The exposition standard also says that different time-series (label combinations) of the same metric should be grouped together with a single help text and type.

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

Using multiple output() calls will tend to interleave the metrics if for example you are looping over a set of similar resources and reporting different metrics about the same resource in each loop iteration - although one could argue this is a programming error in the collector itself and the collector should be looping through each metric and outputting a time-series for each resource rather than looping through each resource and outputting each metric for that resource.

As an example, I have been playing around with an exporter to export Unix Socket metrics:

    import metrics, metrics/chronos_httpserver, posix

    const unixSocketSendQueueLenCollectorName = "unix_socket_send_queue_len"
    const unixSocketSendQueueLimitCollectorName = "unix_socket_send_queue_limit"
    const unixSocketCommonCollectorLabels = ["local_addr", "local_port", "peer_addr", "peer_port"]
    type UnixSocketSendQueueLenCollector = ref object of Collector

    method collect(self: UnixSocketSendQueueLenCollector, output: MetricHandler) =
      let timestamp = self.now()
      let mockSSDatasource = MockSocketStatsDatasource(data: DATA) # Just some mock data of `ss` output
      let ssLister = SocketStatsLister[MockSocketStatsDatasource](datasource: mockSSDatasource) # use SS for now to avoid netlink

      try:
        for socket in ssLister.list():        
            output(
              name = unixSocketSendQueueLenCollectorName,
              value = float64(socket.sendQueueLen),
              labels = unixSocketCommonCollectorLabels,
              labelValues = [socket.localAddr, $socket.localPort, socket.peerAddr, $socket.peerPort],
              timestamp = timestamp
            )
            output(
              name = unixSocketSendQueueLimitCollectorName,
              value = float64(socket.maxSendQueueLen),
              labels = unixSocketCommonCollectorLabels,
              labelValues = [socket.localAddr, $socket.localPort, socket.peerAddr, $socket.peerPort],
              timestamp = timestamp
            )
      except:
        # TODO
        discard

    discard UnixSocketSendQueueLenCollector.newCollector(
      name=unixSocketSendQueueLenCollectorName, # But this should not be per-collector, it should be per-gauge
      help="UNIX Socket send queue metrics", # But this should not be per-collector, it should be per-gauge
      labels=unixSocketCommonCollectorLabels # But what if I need different labels for each gauge? I dont, but some might.
    )

    startMetricsHttpServer()
    discard pause()

I want to loop through all the UNIX sockets in the system and report a couple of metrics (current send queue length and the send queue limit) on them (ignoring label cardinality explosions for now 😄 ).

But because the name and help text is defined at the collector level, this leads to the invalid output (according to the exposition standard):

# HELP unix_socket_send_queue_len UNIX Socket send queue length
# TYPE unix_socket_send_queue_len gauge
unix_socket_send_queue_len{local_addr="/run/dbus/system_bus_socket",local_port="33440",peer_addr="*",peer_port="35386"} 0.0
unix_socket_send_queue_limit{local_addr="/run/dbus/system_bus_socket",local_port="33440",peer_addr="*",peer_port="35386"} 212992.0
unix_socket_send_queue_len{local_addr="/run/systemd/journal/stdout",local_port="34516",peer_addr="*",peer_port="31117"} 0.0
unix_socket_send_queue_limit{local_addr="/run/systemd/journal/stdout",local_port="34516",peer_addr="*",peer_port="31117"} 212992.0
unix_socket_send_queue_len{local_addr="@/tmp/dbus-iLrUs0Z7H5",local_port="87027",peer_addr="*",peer_port="94610"} 0.0
unix_socket_send_queue_limit{local_addr="@/tmp/dbus-iLrUs0Z7H5",local_port="87027",peer_addr="*",peer_port="94610"} 212992.0

Looking at the builtin metrics, it would look like this also suffer from the same issue - the help text and name is actually defined to be generic instead of specific for each metric, which as far as my reading of the spec is incorrect.

Thanks!

jhwbarlow commented 4 months ago

Of course, multiple collectors (one per metric) could be used to work around this, but if the call to gather the data is expensive (for example, the gathering of all the UNIX sockets), that gathering would have to happen multiple times redundantly. This also means the metrics will not reflect the exact same snapshot in time, but that is minor.