Handling giant traces (e.g. traces with many RPC requests)

zbjornson commented 8 years ago

Parts of our application make tons (1,000s to 10,000s) of RPC requests within a single express action. There's no need to log all of these RPC spans -- I would rather not trace RPC calls in these routes and instead use custom spans to manually aggregate the RPC calls.

Excluding a hook only works on the application-level. That issue aside, it also seems like excluding the mongodb-core hook breaks custom spans (ERROR:@google/cloud-trace: Spans can only be created inside a supported web framework), although excluding the http hook does not break this. (Is that why 'mongodb-core' is not listed as a valid entry in the config file?)
Adding routes to ignoreUrls seems to prevent custom spans from working within the routes.

Any suggestions for how this scenario could be handled?

Seems like an ideal situation for using decorators (or an ES5-compatible equivalent) to be able to provide per-route/action settings.

matthewloring commented 8 years ago

Hmm, this is an interesting use case and unfortunately you're running into a few different complexities.

Our patching of database frameworks serves both to generate a span and to ensure that information about the root span for the current request is propagated across async boundaries correctly by cls. Disabling a database patch removes both of these functions causing custom spans to be unable to identify a root span as their parent in some cases. When a child span such as a custom span or database span cannot identify a root span as its parent, the error message you quoted is displayed and the span is dropped (the trace backend does not currently support broken trace graphs).
Adding a url to the ignoreUrls causes no root span to be generated for that incoming request to keep performance overhead low which has the same effect described above: the custom spans are unable to associate with a root span as their parent and are thus not recorded.

For starters, we could add a silent mode for databases that would maintain their patching for asynchronous context propagation but prevent them from generating spans. This would allow custom spans without also generating mongodb spans. However the setting would be applied globally so you would have to use custom spans throughout your codebase wherever you wish to time or aggregate rpcs.

In the longer term, we could explore extending this to per route actions. For that, we should figure out what actions we would want to support and what the performance impact would be (both with custom per route actions and without).

How does this approach sound?

zbjornson commented 8 years ago

Ah, I see... even if I'm just using express and making http requests, excluding mongodb-core could break the propagation? (That doesn't sound quite right...)

Silent mode sounds like it could be done without much of a perf hit, but because it would be global... I started to write that a more useful near-term solution for me would be the ability to create custom root spans so that I could ignore that route and instrument it manually, but I think the mongodb hook would find my custom root span and attach that(?). Making the custom root span on a different namespace than com.google.cloud.trace (edit or different context) would avoid that problem (edit but adds complications for figuring out what to descend from).

Per-route seems ideal, but you're right that e.g. adding a different middleware to each route as well as a different namespace might be costly. Frankly in this case my per-route setting would probably disable all of the hooks and then I'd instrument manually, which would have the same effect as above, if it works.

matthewloring commented 8 years ago

Unfortunately, there are many modules that break context propagation. You can browse through the issues on cls for some examples. In your example, it just happens that mongodb-core is causing the context loss between express and http and it happens that our patch fixes this. If it were a different module causing the loss we would need to investigate a fix independently.

A lot more discussion on context propagation issues/solutions is taking place on the async-wrap proposal for node core as well as the zones TC39 proposal.

Solving the issue at hand with custom root spans was my initial thought as well but I'm not sure we want to complicate the api with a separate span type that cannot accumulate child spans generated by the hooks. It sounds like a silent mode for the db frameworks will be a good first step and we can explore the per-route approach as a potential follow on.

zbjornson commented 8 years ago

Thanks for explaining that -- I've not used cls before.

Other potential solution I thought of is building aggregation/subsampling into this library. Would have to figure out what the "grouping" key is for each hook (e.g. for mongodb commands it could maybe be "db" [the database and collection]), but maybe if more than some number of spans are added with the same grouping key value, a summary span is added that replaces all of them. (Maybe a few of the small spans are also included as examples.)

(PS -- thanks for the proposal for the silent mode, but please don't prioritize that or aggregation on my behalf. We're okay for the time being with manual tracing.)

matthewloring commented 8 years ago

I can socialize the idea of aggregating in the agent to some other people working on tracing here. Our general approach (performance overhead permitting) has been to collect all information possible avoiding summarization in the agent. Reporting all of the information will let the UI summarize/aggregate nicely (which it does not yet do) while still allowing summaries to be expanded if someone needs more information.

We will make sure to get custom root spans working so that manual tracing can cover more cases and go from there.

zbjornson commented 8 years ago

Yeah, display-side aggregation would be a good solution. As nervous as I am about performance, this lib seems to benchmark pretty well (~single-digit percent difference, but hard to accurately measure given the variability of RPC requests). -- Do you know what the max request body size is for the PATCH route? That (along with the potential for server-wide GC pressure) was my other concern and didn't see it documented.

matthewloring commented 8 years ago

I do not know the exact maximum body size but I have run into an upper bound in testing and the server gives 413 response codes back when this occurs. You could try increasing the trace buffer size and putting load on your app until you start seeing 413s. As for GC pressure, that would depend a lot on the existing allocation patterns in your app. Increasing the buffer size will have some impact on GC but the exact impact will be app specific.

matthewloring commented 7 years ago

I don't believe there is outstanding action to be taken on this issue so I'm closing for now. Feel free to reopen if you have any follow ups.

googleapis / cloud-trace-nodejs

Handling giant traces (e.g. traces with many RPC requests) #297