Closed yuhao closed 9 years ago
The reason this bug is important to fix is that with the new flow API, we will inevitably create some tiny slices where flows bind. Incorrect placement of tiny slices is important to reason about precise causality.
@natduca @dj2 @vmpstr
Do we have a reduced test case?
Let me try to build one.
Here's the reduced trace. Looking at the code I find no issue with "TaskGraphRunner::CollectCompletedTasks", both on the original and reduced trace it appears below "ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks" as expected. However, looking at the code I would expect "DidUpdateVisibleTiles" to appear under "TileManager::UpdateVisibleTiles" although at the moment it's under "ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks".
My proposal to solve this problem with be to have events include a "depth" parameter indicating how many slices are open above them in their current thread. When present this parameter would help inform subslice creation to ensure the causality is correctly reflected.
{"traceEvents":[ {"pid":27369,"tid":27369,"ts":326459134461,"ph":"X","cat":"cc","name":"TileManager::UpdateVisibleTiles","args":{},"dur":31,"tdur":30,"tts":1964871}, {"pid":27369,"tid":27369,"ts":326459134461,"ph":"X","cat":"cc","name":"ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks","args":{},"dur":31,"tdur":30,"tts":1964871}, {"pid":27369,"tid":27369,"ts":326459134461,"ph":"X","cat":"cc","name":"TaskGraphRunner::CollectCompletedTasks","args":{},"dur":0,"tdur":0,"tts":1964871}, {"pid":27369,"tid":27369,"ts":326459134492,"ph":"I","cat":"cc","name":"DidUpdateVisibleTiles","args":{"stats":{"canceled_count":0,"completed_count":0}},"tts":1964901,"s":"t"} ]}
Can you fully reduce this? shift the ts to zero, simplify the pids, etc, and remove the chrome specific stuff, remove the tts since its not important for the repro, etc. Making it easy to read is good.
We should seek solutions that do not adjust the way tracing c++ side works. For instance, the position in a trace file seems useful.
{"traceEvents":[ {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"cc","name":"A","dur":31,"tdur":30}, {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"cc","name":"B","dur":31,"tdur":30}, {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"cc","name":"C","dur":0,"tdur":0}, {"pid":0,"tid":1,"ts":31,"ph":"I","cat":"cc","name":"D","s":"t"} ]}
Actually by just looking at the reduced trace, I am not sure how it should look like....
Why should D be under A but not next to A?
This is exactly the problem. If you look at the c++ code you see it's below. But there is no info in the trace to capture that. For example: { TRACE_EVENT0("a", "b"); ... TRACE_EVENT_INSTANT0("a", "c"); }
Yields the exact same trace as: { TRACE_EVENT0("a", "b"); ... } TRACE_EVENT_INSTANT0("a", "c");
The problem is that we do TRACE_EVENT0 with a single "X" event. If it was done with a "B"/"E" pair then event order would allow us to know whether event "c" is inside or outside the brackets. But with a single "X" event there is just no way to know.
I was looking at ways to add as little information as possible to the C++ trace to resolve this ambiguity in a way where we could be backwards compatible. The best idea I've add is a "depth" field. In this case the two pieces of code would yield different traces. The first one: {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"a","name":"b","dur":10,"tdur":10,"depth":0}, {"pid":0,"tid":1,"ts":10,"ph":"I","cat":"a","name":"c","s":"t","depth":1} The second one: {"pid":0,"tid":1,"ts":0,"ph":"X","cat":"a","name":"b","dur":10,"tdur":10,"depth":0}, {"pid":0,"tid":1,"ts":10,"ph":"I","cat":"a","name":"c","s":"t","depth":0}
First sentence should have been: If you look at the C++ you see "DidUpdateVisibleTiles" is called by "TileManager::UpdateVisibleTiles" not "ZeroCopyTileTaskWorkerPool::CheckForCompletedTasks".
How do we pass the depth info around? Seems like then each API is stateful (depth is the state) whereas in our current design each API is stateless.
This is very annoying.... If we really think about it, it's because Chrome timer resolution is not high enough. I remember discussing with Vlad about this, and sounds like there is a plan to move to a better timer implementation? Did we ever consider using something like RDTSC, which gives us machine cycle counts? @vmpstr
Darn, it looks like we really do have a problem here.
Lets avoid assuming that ts precision can be changed and focus on being resilient.
I think that implies that depth is needed, right? The minimal case that proves this is X, followed by X. We don't know if thats two things that are siblings, or whether the second X is inside.
Is that a correct restating of the problem?
@xianzhu can advise on email if it would affect our performance to track depth at the trace event level, but we should coordinate with @fmeawad in case we change the embedding api to blink and v8 and skia.
Also, how important is this compared to all the other things we have floating around? Is this blocking us getting good related events all the time, or does it only crop up 1% of the time? I don't mind either way: more the thing I'm struggling with is that we have a lot flying around here to get associated events, and it'd be good for us to have a shared sense of which needs to move first.
Sounds like in an ideal world, we would get going on this sooner than later... @vmpstr, maybe you can help here?
The minimal case that proves this is X, followed by X. We don't know if thats two things that are siblings, or whether the second X is inside. Is that a correct restating of the problem?
Yes this is a correct restating of the problem. It is generally not a problem because we can look at the start and end timestamps and build the hierarchy correctly. The problem happens when the second X's duration is 0, and it's start timestamp is the same as the end timestamp of the first X.
I think it's important to get this right, but 1) its urgency is lower than #1161, 2) seems like adding a depth attribute to the trace is non-trivial work.
By tiny slices, I mean instant events or X events with duration of 0.
Currently, there are cases where tiny slices are not correctly built into the hierarchy.
As an example, in this trace (https://github.com/natduca/measurmt-traces/blob/master/mobile/googlemaps_ipc_newbinding.html), there is an 'X' event TaskGraphRunner::CollectCompletedTasks (starting at 1,935.145 ms) with duration of 0, and an 'I' event DidUpdateVisibleTiles right (starting at 1,935.175 ms). Both tiny slices are misplaced with respect to the entire slice hierarchy.