google / trace-viewer

https://github.com/catapult-project/catapult/tree/master/tracing#readme
487 stars 84 forks source link

Slice events v2: Place slices into the slice hierarchy correctly even when ts doesn't advance #1131

Closed yuhao closed 9 years ago

yuhao commented 9 years ago

By tiny slices, I mean instant events or X events with duration of 0.

Currently, there are cases where tiny slices are not correctly built into the hierarchy.

As an example, in this trace (https://github.com/natduca/measurmt-traces/blob/master/mobile/googlemaps_ipc_newbinding.html), there is an 'X' event TaskGraphRunner::CollectCompletedTasks (starting at 1,935.145 ms) with duration of 0, and an 'I' event DidUpdateVisibleTiles right (starting at 1,935.175 ms). Both tiny slices are misplaced with respect to the entire slice hierarchy.

yuhao commented 9 years ago

The reason this bug is important to fix is that with the new flow API, we will inevitably create some tiny slices where flows bind. Incorrect placement of tiny slices is important to reason about precise causality.

@natduca @dj2 @vmpstr

natduca commented 9 years ago

Do we have a reduced test case?

philippebeaudoin commented 9 years ago

Let me try to build one.

philippebeaudoin commented 9 years ago

Here's the reduced trace. Looking at the code I find no issue with "TaskGraphRunner::CollectCompletedTasks", both on the original and reduced trace it appears below "ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks" as expected. However, looking at the code I would expect "DidUpdateVisibleTiles" to appear under "TileManager::UpdateVisibleTiles" although at the moment it's under "ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks".

My proposal to solve this problem with be to have events include a "depth" parameter indicating how many slices are open above them in their current thread. When present this parameter would help inform subslice creation to ensure the causality is correctly reflected.

{"traceEvents":[ {"pid":27369,"tid":27369,"ts":326459134461,"ph":"X","cat":"cc","name":"TileManager::UpdateVisibleTiles","args":{},"dur":31,"tdur":30,"tts":1964871}, {"pid":27369,"tid":27369,"ts":326459134461,"ph":"X","cat":"cc","name":"ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks","args":{},"dur":31,"tdur":30,"tts":1964871}, {"pid":27369,"tid":27369,"ts":326459134461,"ph":"X","cat":"cc","name":"TaskGraphRunner::CollectCompletedTasks","args":{},"dur":0,"tdur":0,"tts":1964871}, {"pid":27369,"tid":27369,"ts":326459134492,"ph":"I","cat":"cc","name":"DidUpdateVisibleTiles","args":{"stats":{"canceled_count":0,"completed_count":0}},"tts":1964901,"s":"t"} ]}

natduca commented 9 years ago

Can you fully reduce this? shift the ts to zero, simplify the pids, etc, and remove the chrome specific stuff, remove the tts since its not important for the repro, etc. Making it easy to read is good.

We should seek solutions that do not adjust the way tracing c++ side works. For instance, the position in a trace file seems useful.

yuhao commented 9 years ago

{"traceEvents":[ {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"cc","name":"A","dur":31,"tdur":30}, {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"cc","name":"B","dur":31,"tdur":30}, {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"cc","name":"C","dur":0,"tdur":0}, {"pid":0,"tid":1,"ts":31,"ph":"I","cat":"cc","name":"D","s":"t"} ]}

yuhao commented 9 years ago

Actually by just looking at the reduced trace, I am not sure how it should look like....

Why should D be under A but not next to A?

philippebeaudoin commented 9 years ago

This is exactly the problem. If you look at the c++ code you see it's below. But there is no info in the trace to capture that. For example: { TRACE_EVENT0("a", "b"); ... TRACE_EVENT_INSTANT0("a", "c"); }

Yields the exact same trace as: { TRACE_EVENT0("a", "b"); ... } TRACE_EVENT_INSTANT0("a", "c");

The problem is that we do TRACE_EVENT0 with a single "X" event. If it was done with a "B"/"E" pair then event order would allow us to know whether event "c" is inside or outside the brackets. But with a single "X" event there is just no way to know.

I was looking at ways to add as little information as possible to the C++ trace to resolve this ambiguity in a way where we could be backwards compatible. The best idea I've add is a "depth" field. In this case the two pieces of code would yield different traces. The first one: {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"a","name":"b","dur":10,"tdur":10,"depth":0}, {"pid":0,"tid":1,"ts":10,"ph":"I","cat":"a","name":"c","s":"t","depth":1} The second one: {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"a","name":"b","dur":10,"tdur":10,"depth":0}, {"pid":0,"tid":1,"ts":10,"ph":"I","cat":"a","name":"c","s":"t","depth":0}

philippebeaudoin commented 9 years ago

First sentence should have been: If you look at the C++ you see "DidUpdateVisibleTiles" is called by "TileManager::UpdateVisibleTiles" not "ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks".

yuhao commented 9 years ago

How do we pass the depth info around? Seems like then each API is stateful (depth is the state) whereas in our current design each API is stateless.

yuhao commented 9 years ago

This is very annoying.... If we really think about it, it's because Chrome timer resolution is not high enough. I remember discussing with Vlad about this, and sounds like there is a plan to move to a better timer implementation? Did we ever consider using something like RDTSC, which gives us machine cycle counts? @vmpstr

natduca commented 9 years ago

Darn, it looks like we really do have a problem here.

Lets avoid assuming that ts precision can be changed and focus on being resilient.

I think that implies that depth is needed, right? The minimal case that proves this is X, followed by X. We don't know if thats two things that are siblings, or whether the second X is inside.

Is that a correct restating of the problem?

@xianzhu can advise on email if it would affect our performance to track depth at the trace event level, but we should coordinate with @fmeawad in case we change the embedding api to blink and v8 and skia.

Also, how important is this compared to all the other things we have floating around? Is this blocking us getting good related events all the time, or does it only crop up 1% of the time? I don't mind either way: more the thing I'm struggling with is that we have a lot flying around here to get associated events, and it'd be good for us to have a shared sense of which needs to move first.

Sounds like in an ideal world, we would get going on this sooner than later... @vmpstr, maybe you can help here?

yuhao commented 9 years ago

The minimal case that proves this is X, followed by X. We don't know if thats two things that are siblings, or whether the second X is inside. Is that a correct restating of the problem?

Yes this is a correct restating of the problem. It is generally not a problem because we can look at the start and end timestamps and build the hierarchy correctly. The problem happens when the second X's duration is 0, and it's start timestamp is the same as the end timestamp of the first X.

I think it's important to get this right, but 1) its urgency is lower than #1161, 2) seems like adding a depth attribute to the trace is non-trivial work.

catapult-bot commented 9 years ago

Migrated to https://github.com/catapult-project/catapult/issues/1131