RFC: Support Report Web Vitals as Opt-In CloudWatch Metrics

simonireilly commented 3 years ago

Is your feature request related to a problem? Please describe.

With reportWebVitals in a custom _app.js we can support recording Web Vitals as per docs: https://nextjs.org/docs/advanced-features/measuring-performance#web-vitals

Describe the solution you'd like

Support a /analytics endpoint.
Use same site domain to avoid updating CSP or CORS.
Upload minimal data with Async POST requests to edge lambda.
Configure analytics lambda to record custom metrics in CloudWatch, perform Async transforms of minimal data from client and formatting metrics and dimensions.
Metrics should support per page analysis.
Metrics package should be decoupled and deployment would be optional configuration.

Describe alternatives you've considered

Two popular managed services for this:

These are good services, but they are paid services.

Additional context

Further support for alarms and regressions could be added by open extensibility.
Further support for regional dimensions might be wanted.

Background reading: https://developer.mozilla.org/en-US/docs/Learn/Performance/Measuring_performance

danielcondemarin commented 3 years ago

Thanks for raising @simonireilly.

I can see the value in having this. Could you provide a bit more detail about what you'd expect to have in CloudWatch? Ideally we would provide some level of feature parity with the newly announced Next.js Analytics.

My only concern is whether CloudWatch is geared well enough for this (especially in terms of Dashboards / Visualisations).

Adding a bit more detail about exactly what features you'd like to see and how it would be implemented in CloudWatch would help moving things more quickly.

simonireilly commented 3 years ago

@danielcondemarin, sure, I can elaborate.

Data Collection

It is probably best to begin with the implementation in NextJS source code.

https://www.github.com/vercel/next.js/tree/canary/packages%2Fnext%2Fclient%2Fperformance-relayer.ts

This performance-layer.ts module in required in client/index. It is mounted in a useEffect hook to run after DomContentLoaded has occured.

Line 49 has the hardcore's endpoint for vercel-analytics.com. I would expect this to be a process.env in the future as this proprietary endpoint hard coding is not OSS in my opinion.

This module will observe the metrics required from the web-vitals package

Cumulative Layout Shift (CLS)
First Contentful Paint (FCP)
First Input Delay (FID)
Largest Contentful Paint (LCP)
Time to First Byte (TTFB)

Custom CloudWatch Metrics

Capturing these as Custom CloudWatch Metrics would be the proposed solution. There could be unique metric names for each of these five webvitals per deployment.

MyNextServerlessCLS
MyNextServerlessFCP
MyNextServerlessFID
MyNextServerlessLCP
MyNextServerlessTTFB

MetricDatum can be seconds as sent in the value field of the body from the nextJS performance-layer.ts.

These can be visualised in percentiles as the user wants, but defaults of P75 (recommended bench marker https://web.dev/vitals/) would be ideal for any dashboard.

To enhance these metrics it would be possible to add some dimensions.

Device='mobile' | 'tablet' | 'desktop'
Page=${body.page} ( Next page .e.g. /[slug].js
Region=${process.env.AWS_REGION} (default lambda env)

It is worth noting that you cannot aggregate custom metrics along multiple dimensions, being overly specific means you cannot re-aggregate for the global metric.

Presentation (CloudWatch Dashboard)

These dashboards are some what limited but the main components can be achieved:

Number type widget with the real user index (1-100). It must be possible to determine this from the metrics sent as Vervel receives nothing other than what next sends. This will just involve a little algebra which can setup a custom formula for a CloudWatch trace using the other metrics.
CloudWatch graphs for each web vital metric with configured percentiles.
Potentially there will be an either column or row layout to separate mobile/tablet/desktop

The last piece is the pages section. Having this dimension will be helpful I think, but visualizing so many dimensions might be overload.

Some trial and error might be required.

Hope that makes sense. The architecture would be as described previously.

API endpoint for receiving metrics
Extract Transform web vital to CloudWatch metric with dimensions in lambda

danielcondemarin commented 3 years ago

Hey @simonireilly Thanks for the great level of detail!

Line 49 has the hardcore's endpoint for vercel-analytics.com. I would expect this to be a process.env in the future as this proprietary endpoint hard coding is not OSS in my opinion.

Surprised to see this! Sounds like we need to raise an issue in Next.js first to sort that out.

Also might be worth thinking if the distributed nature of Lambda@Edge CloudWatch Logs affects anything!

simonireilly commented 3 years ago

Sounds good, I have opened an issue, we will see if there is any desire for the framework to make the change.

https://github.com/vercel/next.js/issues/18907

Implementation

Write to Cloudwatch from Lambda@Edge will be fast over AWS optimized network
Ensure keepalive HTTPS where possible in Lambda@Edge with aws-sdk

This is a simple implementation in pages/api/v1/vitals:

import { NextApiRequest, NextApiResponse } from 'next'
import CloudWatch, { MetricDatum } from 'aws-sdk/clients/cloudwatch';

const cloudWatchClient = new CloudWatch({ apiVersion: '2010-08-01' })

const params = (webVital: WebVital): MetricDatum => ({
  MetricName: webVital.event_name,
  Dimensions: [
    {
      Name: 'NextJSPage',
      Value: webVital.page
    },
  ],
  Unit: 'Milliseconds',
  Value: parseFloat(webVital.value)
});

const handler = async (req: NextApiRequest, res: NextApiResponse) => {
  try {
    const webVital: WebVital = req.body
    const metricData = params(webVital)

    await cloudWatchClient.putMetricData({
      MetricData: [metricData],
      Namespace: 'NextJsApplication'
    }).promise()

    return res.send(200)
  } catch (err) {
    console.error(err)
  }
  return res.status(422).json({
    error: 'Failed to send Metrics, check server/lambda logs for details'
  })
}

export default handler

type WebVital = {
  dsn: string
  id: string
  page: string
  href: string
  event_name: string
  value: string
  speed: string
}

Outputs

You receive each webVital, for each page as a Metric to aggregate using any possible cloudwatch functions. A simple number board below gives the p75 of all metrics for all pages over a day in all regions.

dphang commented 3 years ago

Looks good, I would say to use the AWS SDK v3 for this though, since it is modular it has very low single-digit ms cold start times. We are using that for S3 calls within the handler.

For CloudWatch API call, will you be sending to a single region or it is distributed to the closest region from where the Lambda@Edge was invoked?

simonireilly commented 3 years ago

Looks good, I would say to use the AWS SDK v3 for this though, since it is modular it has very low single-digit ms cold start times. We are using that for S3 calls within the handler.

👍

For CloudWatch API call, will you be sending to a single region or it is distributed to the closest region from where the Lambda@Edge was invoked?

That is maybe up for discussion, would be good to know if cross-region metrics are desirable before committing to an architecture and cost for them.

Options I would say are:

Store metrics in each calling region.
Store metrics in us-east-1 where the edge lambda source lives.

If latency is to be truly minimised then firing a lambda asynchronously would be the best bet. This means we just fire the body to the API, we don't wait for the cold boot, or HTTPS handshake between Lambda and CloudWatch API, or anything really. Lambda will handle queuing this up and retrying.

@dphang Is this possible on the edge? I am not sure it is.

Final thing, it's a no from NextJS for making this extensible https://github.com/vercel/next.js/issues/18907#issuecomment-723300381

With that being said this is potentially a non-starter as it would require custom implementation. I don't believe that is an attractive proposition but if there is still an interest in the feature then it can be done 🤷‍♂️

dphang commented 3 years ago

I think Lambda@Edge is pretty similar to Lambda right now, there aren't much limitations anymore (for origin handlers) save for the environment variables and no provisioned concurrency.

I think you can make an async call and don't wait for the response. But I thought this is a reporting API from client side anyway, i.e it doesn't block rendering of the page itself? I haven't used this new feature yet so not as familiar with it.

I did see from here that you can send the metrics to any endpoint, e.g the example they gave was for Google Analytics. I guess you want to build this into the Lambda@Edge itself to send data to CloudWatch instead?

danielcondemarin commented 3 years ago

With that being said this is potentially a non-starter as it would require custom implementation. I don't believe that is an attractive proposition but if there is still an interest in the feature then it can be done 🤷‍♂️

What do you think if we introduce our own performance-relayer client implementation? To start with it could be the similar or same as Next.js Vercel one.

So if users opt-in to the Analytics functionality we'd bootstrap the backend Analytics endpoint in Lambda@Edge for them.

component: @sls-next/serverless-component
inputs:
  analytics: true

Client side they could install an NPM module, e.g.

# pages/_app.js
export { default as reportWebVitals } from '@sls-next/analytics-client';

Later on, we could provide some way to allow for extensibility.

I think you can make an async call and don't wait for the response. But I thought this is a reporting API from client side anyway, i.e it doesn't block rendering of the page itself? I haven't used this new feature yet so not as familiar with it.

That's right @dphang it doesn't block rendering. Ideally we would support using the Beacon API which is generally more efficient than using fetch directly and it handles when metrics are sent and the page is unloaded (e.g. an external link click).

benjaminkay93 commented 3 years ago

just wanting to drop in and say this would be mega cool, thanks for all the hard work ^

serverless-nextjs / serverless-next.js