scoutapp / roadmap

The public roadmap for Scout application monitoring.
https://scoutapp.com
16 stars 2 forks source link

Scaling transaction trace storage with the number of endpoints #70

Open itsderek23 opened 6 years ago

itsderek23 commented 6 years ago

Currently we store a max of 10 transaction traces per app, per-minute.

This has issues in the following scenarios:

  1. An app has a large number of web endpoints. For example, if an app has 1k unique endpoints and we collect 10 traces, one per-endpoint, that means we'd cover 1% of endpoints in a given minute. If an app has 100, we cover 10%.

  2. Zooming into a small time slice (ie 5 minutes) to examine an outlier. There are fewer traces to examine in a small period of time. This is more obvious when an app has a large number of endpoints.

Look at increasing the number of transaction traces we store, basing it as a percentage of the number of uniquely named endpoints in an app with a reasonable max. Initially this can just be a server-side change.

dlanderson commented 6 years ago

Do we have any data about the number of endpoints across apps?

cschneid commented 6 years ago

The histogram peak is at ~300 endpoints, trailing off pretty fast after that with only a very small handful of customers having more than 1000 endpoints.

itsderek23 commented 6 years ago

This change has been deployed for a couple of accounts. Additionally, we're tracking analytics on how often we return zero traces at key interactions.

From spot-checking data, I'm not seeing a significant improvement, esc. on the database query list. This may be caused by not including a "% time consumed" dimension in our trace scoring algorithm (we include the response time). In a couple of cases, I found zero traces collected over a 1-hour period for the top 8 most time-consuming (and likely the endpoints you would most want to access) in one app, for example.

Expensive queries are more likely to be called from expensive endpoints.

Two thoughts:

  1. Incorporate a "time consumed" dimension in our algorithm
  2. If zero traces are found, fetch over a longer period with a warning to the user. Return something if we can. It's common for behavior to repeat itself.

Generally: when a transaction is collected from a low-volume endpoint and the response time is fast / moderate, it's less likely to be acted upon. It's just that significant. Very slow requests are still interesting (and we account for that).

itsderek23 commented 6 years ago

We've deployed an update to address:

Zooming into a small time slice (ie 5 minutes) to examine an outlier. There are fewer traces to examine in a small period of time. This is more obvious when an app has a large number of endpoints.

2 areas:

  1. When clicking on a db query, this increase the timeframe if no traces are found in the selected tf:

image

  1. When viewing traces on an endpoint or background job, if not zooming the tf is also increased if no traces are found:

image