Trace Explorer - Filter traces by response time, custom context, endpoint, etc.

lmansur commented 6 years ago

I have a custom context to differentiate Web requests to API requests. It would be very useful if I could filter my traces by that specific context, since most of the time I want to focus on improving web-related issues first.

This would also be very useful when trying to find a performance issue for a specific User, since I also have a context with their information.

itsderek23 commented 6 years ago

@lmansur - agreed. Many performance issues are directly related to the size of the data being operated on. This is often correlated to the current user in the session, current account, etc.

We've started working on a POC for this. No ETA yet, but it's a focus for us.

itsderek23 commented 6 years ago

We're moving along well on this.

A couple of tools we're excited about for filtering traces, which is a multi-dimensional dataset:

These let you filter data in realtime, making exploration significantly faster than constructing queries that require server-side execution.

lmansur commented 6 years ago

Very interesting tools, thank you for sharing and keeping the issue up to date!

itsderek23 commented 6 years ago

Here's a video of the current state - this isn't yet exposed in the UI, but you can get a flavor for the interaction:

screen

lmansur commented 6 years ago

Thank you, Derek!

itsderek23 commented 6 years ago

@lmansur @qrush @pjuanda @justinstern @nathansamson @jonzlin95 - this is now available under our Tech Preview Program. You'll see a new "Traces" link at the top of the app nav. Click this to access Trace Explorer:

There are numerous rough edges, but we've been getting a lot of value from Trace Explorer internally and figure others will too. We'll work thru these issues as usual.

In other words 👇

Share your initial feedback via this issue or by emailing support@scoutapp.com!

nathansamson commented 6 years ago

My 2 cents: A) Context filters: you only see the top 5, and no easy way to discover the other values B) Context filters: You can't select default context (eg node) C) Their is a weird bug when clicking on the "By Response Time" diagram (to select only certain values) it actually "detects" the click around 50% further in the diagram. Once you have the estimated zone, you can drag it to the right location. (Firefox, not tested in other browsers) D) Selecting all values except one (or a few) does not seem to be possible. I would expect this to work with ctrl-click on the selected value (and it would deselect it)

But great feature, and it has helped us tremendously in spotting a few pages that act out some times (but not often enough to impact the averages too much)

itsderek23 commented 6 years ago

A) Context filters: you only see the top 5, and no easy way to discover the other values

See #53. Makes sense.

B) Context filters: You can't select default context (eg node)

Anything you are missing besides the node name?

C) Their is a weird bug when clicking on the "By Response Time" diagram (to select only certain values) it actually "detects" the click around 50% further in the diagram. Once you have the estimated zone, you can drag it to the right location. (Firefox, not tested in other browsers)

That is weird. See #52.

D) Selecting all values except one (or a few) does not seem to be possible. I would expect this to work with ctrl-click on the selected value (and it would deselect it)

Makes sense. See https://github.com/scoutapp/roadmap/issues/54.

Please follow these specific issues (and 👍 if you're interested) for updates.

nathansamson commented 6 years ago

About B) I think node is the most important one, other includes

URI (but that is handled mostly by the endpoint)
TIme since startup, but that does not seem too helpful either
User IP we already have
Git revision might be helpful, to see if it is a given deploy that has all the slow requests

itsderek23 commented 6 years ago

Thx @nathansamson.

nathansamson commented 6 years ago

Did this change recently? Today or so?

It is now displaying the User ID's like they have an ordering. Previously they got displayed like any other field, this made more sense to me

itsderek23 commented 6 years ago

Did this change recently? Today or so?

Yes.

See #63 for a proposed fix.

nathansamson commented 6 years ago

Another issue I noticed today.

When I run traces for the past 12h the trace with the max duration is 27s. When I run traces for the past 6h I get a trace with max duration 70s. (and it did happen a few hours ago, not like in the last minute)

How is that possible? The only logical thing I can imagine is the "1,000 selected out of 1,000 traces" Since over 12h we have way more than 1000 traces in the 12h view it only takes the first 1000 which did not include the later (very slow) ones...

What I would be able to do is check every day for the past 1 day, check ALL traces that are taking longer than a given threshold so I can optimize them. Now they are lost, if the first few traces of the day are quick...

I can understand from a perf POV it is not possible to load 10k traces in the view, but then you should let me pre filter them (everything with response time greater than X for example)

PS: If you prefer me making new issues directly, instead of commenting on this one I am happy to do so

itsderek23 commented 6 years ago

How is that possible?

We select a 1k sampling of all traces over the time period. Over a 6-hour period, up to 3.6k traces could be collected.

Mind creating a separate issue? I can see two modes to start:

Random sample (best for diversity)
Slowest

nathansamson commented 6 years ago

Created issue #64 for this.

Just out of curiosity. Why the 3.6k limit? Where is that being controlled? Is the client only sending one detailed trace per 10 seconds?

itsderek23 commented 6 years ago

Created issue #64 for this.

Thanks!

Just out of curiosity. Why the 3.6k limit? Where is that being controlled? Is the client only sending one detailed trace per 10 seconds?

The agent sends up to 10 per-minute. We have an algorithm to determine interesting traces (both on the client and on the server) that then takes traces across all hosts and selects up to 10 per-minute per app.

Collecting detailed traces adds more overhead, so this is sampled.

itsderek23 commented 6 years ago

We've added a chart for memory allocations.

This needs #64 so you can view the worst performers.

doutatsu commented 3 years ago

No updates since 2018? Is this still tech preview? Shouldn't this be closed now 🤔

scoutapp / roadmap

Trace Explorer - Filter traces by response time, custom context, endpoint, etc. #33