vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.59k stars 586 forks source link

New metric to expose the number of content groups #28983

Open pfrybar opened 10 months ago

pfrybar commented 10 months ago

Is your feature request related to a problem? Please describe. When creating monitoring dashboards, it's difficult to know how many documents exist when using a grouped distribution, from just metric data alone. We can use the current metrics to find the total documents across the entire cluster, but there is no way to find the "correct" (average) number of documents within a content group.

Describe the solution you'd like Expose a metric for the number of content groups currently in use.

Describe alternatives you've considered We have graphs of total document count which show a rough picture, but they get noisy when the number of content groups change. See the screenshot below as an example:

Screenshot 2023-10-17 at 14 20 26
yngveaasheim commented 10 months ago

There are two ways you can get a better view of the number of documents in your setup:

  1. Split or filter on the "groupId" tag/dimension when aggregating number of documents per content group when using the "searchnode.content.proton.documentdb.documents.ready" metric or similar. Note that you will also need to take "searchable-copies" into account here, or "redundancy" for some of the related metrics.
  2. Aggregate on the "distributor.vds.distributor.docsstored" metric instead, to get the number of unique documents per content cluster.

Please let me know if this helps you accomplish what you want.

Best, -Yngve

pfrybar commented 10 months ago

Thanks for the quick response. I wasn't able to find a "groupId" dimension on any of the metrics, whether using the aggregated v2 metrics, node-level v1 metrics, or the prometheus endpoint.

I could try to do some aggregations on "distributor.vds.distributor.docsstored", but since we are using multiple content groups and I can't find a dimension to group by I'm not sure how to do it in a generic way.

For example, here is /state/v1/metrics on a content node:

      {
        "name": "content.proton.documentdb.documents.total",
        "description": "The total number of documents in this documents db (ready + not-ready)",
        "values": {
          "average": 9743208.0,
          "sum": 116918496.0,
          "count": 12,
          "rate": 0.2,
          "min": 9743208,
          "max": 9743208,
          "last": 9743208
        },
        "dimensions": {
          "documenttype": "mydocument"
        }
      },
yngveaasheim commented 10 months ago

You should not use the groupId dimension for the distributor metric, but sum over them. If you have multiple content clusters then you will need to aggregate per cluster using the clusterid dimension.

Unfortunately it seems metrics are only decorated with those dimensions in the Vespa Cloud currently.

yngveaasheim commented 10 months ago

I will clear assignee to have this discussed during our upcoming ticket scrub.