Proposal for a common metrics format

chrismcleod commented 4 years ago

Component Metrics Format

@ac360 @eahefnawy @hkbarton @medikoo

Motivation

Each component needs the ability to report metrics in terms of the use-case for that component. For example a website component needs to see a requests count and is not too concerned with the memory consumed. We want to build tools that can consume these metrics from any component moving forward without having to re-factor the tools or have unique metrics handling per-component. To facilitate this, we need a common metrics format.

Proposal

Based on a mock provided by ac360, this is the format needed for each metric. There is essentially two data formats. First, a "stacked" format where there is one set of x values and multiple series of y values. Second, a "basic" format that is a simple set of x, y data points. This structure borrows heavily from highcharts as they have hardened their definitions for quite some time.

This is an example response for a 7 day range for the express component. We want to be sure that we have all timestamps for the chart range, even if the values are 0. If ALL values are 0, return the empty chart.

// 7 day range example
export const response = {
  // the selected range for all the metrics.  This will be sent in the request
  // for the metrics and should be returned in the response
  rangeStart: '2020-04-15T23:49:03.240Z',
  rangeEnd: '2020-04-22T23:49:03.241Z',

  metrics: [
    {
      type: 'stacked-bar', // constant
      title: 'api requests & errors', // constant
      x: {
        type: 'timestamp', // constant
        values: [
          1587080943241,
          1587167343241,
          1587253743241,
          1587340143241,
          1587426543241,
          1587512943241,
          1587599343241,
        ], // This is an example of a 7 day range, so there are 7 UTC timestamps with the most recent last
      },
      y: [
        {
          name: 'requests', // constant
          type: 'count', // constant
          total: 8306, // total of all values for this series
          values: [1497, 1022, 1010, 1002, 1186, 1331, 1258], // count of normal requests over the time range.  Does not include errored requests
        },
        {
          name: 'errors', // constant
          type: 'count', // constant
          total: 1783, // total of all values for this series
          color: 'error', // constant
          values: [325, 356, 464, 222, 230, 70, 116], // count of errored requests over the time range.  Does not include normal requests
        },
      ],
    },
    {
      type: 'list-details-bar', // constant
      title: 'api path requests', // constant
      x: {
        type: 'string', // constant
        values: [
          'GET - /messages/:id',
          'GET - /messages/long/long/long/long/long/path',
          'POST - /messages',
        ], // all paths with > 0 requests during the time range
      },
      y: [
        {
          name: 'requests', // constant
          type: 'count', // constant
          total: 6000, // total number of requests across all paths for the time range
          values: [3000, 2000, 1000], // total number of requests per path over the time range
        },
        {
          name: 'avg latency', // constant
          type: 'duration', // constant
          total: 500, // maximum latency across all paths for the time range
          values: [120, 200, 500], // average latency in ms per path over the time range
        },
      ],
    },
    {
      type: 'multiline', // constant
      title: 'api latency', // constant
      x: {
        type: 'timestamp', // constant
        values: [
          1587080943241,
          1587167343241,
          1587253743241,
          1587340143241,
          1587426543241,
          1587512943241,
          1587599343241,
        ], // This is an example of a 7 day range, so there are 7 UTC timestamps with the most recent last
      },
      y: [
        {
          name: 'p50 latency', // constant
          type: 'duration', // constant
          total: 1800, // 50th percentile latency across all requests for the time range
          values: [100, 300, 200, 100, 300, 200, 600], // 50th percentile latency across all requests for the time period
        },
        {
          name: 'p95 latency', // constant
          type: 'duration', // constant
          total: 1700, // 95th percentile latency across all requests for the time range
          values: [200, 100, 200, 300, 300, 100, 500], // 95th percentile latency across all requests for the time period
        },
      ],
    },
    {
      type: 'list-flat-bar', // constant
      title: 'api errors', // constant
      color: 'error', // constant
      x: {
        type: 'string', // constant
        values: [
          'GET - /messages/:id',
          'GET - /messages',
          'POST - /messages',
          'GET /messages',
        ], // all paths with > 0 http errors during the time range
      },
      y: [
        {
          name: '400 Bad Request', // the http error code
          type: 'count', // constant
          total: 25, // total number of requests with this error across all paths for the time range
          values: [10, 15, 0, 0], // total number of requests with this error per path over the time range
        },
        {
          name: '401 Unauthorized',
          type: 'count',
          total: 25,
          values: [10, 15, 0, 1],
        },
        {
          name: '500 Internal Server Error',
          type: 'count',
          total: 45,
          values: [20, 15, 10, 20],
        },
      ],
    },
    {
      type: 'stacked-bar', // constant
      title: 'api 5xx errors', // constant
      x: {
        type: 'timestamp', // constant
        values: [
          1587080943241,
          1587167343241,
          1587253743241,
          1587340143241,
          1587426543241,
          1587512943241,
          1587599343241,
        ], // This is an example of a 7 day range, so there are 7 UTC timestamps with the most recent last
      },
      y: [
        {
          name: 'errors', // constant
          type: 'count', // constant
          total: 1783, // total number of 5xx errors over the time range
          color: 'error', // constant
          values: [325, 356, 464, 222, 230, 70, 116], // count of 500 errors per time period
        },
      ],
    },
    {
      type: 'stacked-bar', // constant
      title: 'function invocations & errors', // constant
      x: {
        type: 'timestamp', //constant
        values: [
          1587080943241,
          1587167343241,
          1587253743241,
          1587340143241,
          1587426543241,
          1587512943241,
          1587599343241,
        ], // This is an example of a 7 day range, so there are 7 UTC timestamps with the most recent last
      },
      y: [
        {
          name: 'requests', // constant
          type: 'count', // constant
          total: 8306, // total count of normal invocations for the whole time range
          values: [1497, 1022, 1010, 1002, 1186, 1331, 1258], // count of normal invocations per time period
        },
        {
          name: 'errors', // constant
          type: 'count', // constant
          total: 1783, // total count of errored invocations over the whole time range
          color: 'error',
          values: [325, 356, 464, 222, 230, 70, 116], // total count of errored invocations per time period
        },
      ],
    },
    {
      type: 'stacked-bar', // constant
      title: 'api 4xx errors', // constant
      x: {
        type: 'timestamp', // constant
        values: [
          1587080943241,
          1587167343241,
          1587253743241,
          1587340143241,
          1587426543241,
          1587512943241,
          1587599343241,
        ], // This is an example of a 7 day range, so there are 7 UTC timestamps with the most recent last
      },
      y: [
        {
          name: 'errors', // constant
          type: 'count', // constant
          total: 1783, // total count of all 4xx errors over the time range
          color: 'error',
          values: [325, 356, 464, 222, 230, 70, 116], // count of 4xx errors per time period
        },
      ],
    },
    {
      type: 'multiline', // constant
      title: 'function latency', // constant
      x: {
        type: 'timestamp', // constant
        values: [
          1587080943241,
          1587167343241,
          1587253743241,
          1587340143241,
          1587426543241,
          1587512943241,
          1587599343241,
        ], // This is an example of a 7 day range, so there are 7 UTC timestamps with the most recent last
      },
      y: [
        {
          name: 'p50 latency', // constant
          type: 'duration', // constant
          total: 1800, // 50th percentile latency across all invocations for the whole time range
          values: [100, 300, 200, 100, 300, 200, 600], // 50th percentile latency across all requests for the time period
        },
        {
          name: 'p95 latency', // constant
          type: 'duration', // constant
          total: 1700, // 95th percentile latency across all requests for the time range
          values: [200, 100, 200, 300, 300, 100, 500], // 95th percentile latency across all requests for the time period
        },
      ],
    },
    {
      type: 'list-default-bar', // constant
      title: 'top function errors', // constant
      color: 'error', // constant
      x: {
        type: 'string', // constant
        values: ['Error type a', 'Error type b', 'Error type c'], // the error messages that occurred the most during the whole time range
      },
      y: {
        type: 'count', // constant
        total: 135, // the total number of times the top error messages were seen over the whole time range
        values: [100, 25, 10], // the total number of times teach error message was seen over the whole time range
      },
    },
  ],
}

Empty chart

{
  "title": "Some chart with no data",
  "type": "empty-chart"
}

Common time buckets

15 minutes

15 data points
by the minute

60 minutes

12 data points
every 5 minutes

24 hrs

24 data points
by the hour

7 days

7 data points
by the day

chrismcleod commented 4 years ago

This is one example of how these metrics could be rendered:

instance - overview - data

hkbarton commented 4 years ago

looks great! I'll cc tencent team

medikoo commented 4 years ago

@chrismcleod great job, looks very promising! Few comments:

What will trigger generation of those metrics? Will our backed ask component for it periodically? Or will components on it's own generate them on request directly for the dashboard?
Isn't total reduntant in below ?

{
  name: 'requests', // constant
  type: 'count', // constant
  total: 8306, // total of all values for this series
  values: [1497, 1022, 1010, 1002, 1186, 1331, 1258], // count of normal requests over the time range.  Does not include errored requests
}

chrismcleod commented 4 years ago

@medikoo the metrics are available as a run action on each component. each component can generate/cache its metrics however it wants.

A goal of this format is to remove as much logic from any consuming client as possible. The total here might, at this time, be a simple sum of the array; but I would rather the total be explicit rather than every consuming client need to re-calculate it using that logic (which might change). What do you think?

medikoo commented 4 years ago

@medikoo the metrics are available as a run action on each component.

Ok so you mean that component is expected to respond with such metrics when sls metrics (or sls run) is run against it (?)

I have problems in understanding what triggers generation of metrics and how often it'll be generated

The total here might, at this time, be a simple sum of the array; but I would rather the total be explicit rather than every consuming client need to re-calculate it using that logic (which might change). What do you think?

As long as it's redundant, I would remove it. Redundant data in my feeling is (1) confusing (why it's provided, maybe it doesn't necessary reflects a sum in an array?) (2) error-prone (if it doesn't match a combined sum in an array, which should be treated as a source of truth?) and (3) there's no real cost for a client to resolve it on its own.

hkbarton commented 4 years ago

@medikoo the data behind metrics is generated when there is an invocation to the related resources. e.g. whenever there is an invocation to cloud function, the log will be generated. sls metrics will trigger the metrics function on component, and the component will query whatever the cloud infrastructure can provide to query those invocation logs, e.g. AWS Cloudwatch, Tencent metrics queries...

eahefnawy commented 4 years ago

@medikoo To put it in simpler terms (if I understood it correctly), this is just a standard outputs format/structure for any metrics method on any component.

Each component would have different logic for actually gathering the data, but at the end of the day, they need to return the data in that format so that the frontend could consume it regardless of which component it came from.

Did I understand that correctly @chrismcleod ?

medikoo commented 4 years ago

@hkbarton @eahefnawy thanks for clarification.

So components will write logs to CloudWatch on the course of typical commands (deploy etc)

Then there'll be dedicated sls metrics command, on which component should query the CloudWatch logs it generated and prepare metrics data for given query. Right?

Do we have some strategy planned for scaling that? e.g. I can imagine that some active component can in short time produce a significant amount of logs. Having that it may be near impossible for component to retrieve all needed logs and generate metrics for typical sls metrics call.

Usually generation of such metrics is backed with quite sophisticated tools which are equipped with means to handle large amount of input and deducting needed answers in short time.

chrismcleod commented 4 years ago

@eahefnawy that is correct. Consumers could be "any" client. Including perhaps some basic CLI charts in a galaxy far far away ;)

This format will also eventually be a sub-property of a full custom dashboard description.

eahefnawy commented 4 years ago

@medikoo I think we already solved that problem in our current Framework Pro dashboard, no? 🤔

medikoo commented 4 years ago

@medikoo I think we already solved that problem in our current Framework Pro dashboard

If I understand proposal correctly, component will provide to dashboard a result chart, and it's dashboard front-end that will draw them as received.

I was wondering how it'll work in case of intensive apps. e.g. I remember working for client with thousands of users which had millions of CloudWatch logs generated everyday. How component can produce for such case a reliable charts data which e.g. overview a month?

@hkbarton settled me down a bit that on Tencent side, provider provides an elastic search mechanism to logs, so already component would query a reduced (result set) and won't inspect logs 1 by 1 in its won capacity.

I wonder what's the plan for AWS, afaik CloudWatch doesn't provide such feature out of a box (but I also have limited experience with CloudWatch)

serverless / components