grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.59k stars 486 forks source link

Feature: automatic logging of span events, exceptions #653

Closed LouisStAmour closed 3 years ago

LouisStAmour commented 3 years ago

Not sure if this is already on the roadmap or a dupe, but I've been using OpenTelemetry traces as basically a replacement for logging including the occasional Span Event and Exception recording.

It would be nice if metrics, events and exceptions could be logged in real-time. Then after making a tail sampling decision but before deleting traces, exemplars could be added to metrics for kept traces while root spans could be logged at that point.

The only advantage I can think of waiting until after tail sampling to log span events and exceptions would be if you want to record whether the trace was kept or not, but it would seem to me that having real-time data is better than having the most complete data.

I've thought previously about maintaining two streams of metrics and logs: one real-time and one, more accurate, for archival purposes that can re-order data to better fit Cortex and Loki requirements.

Okay, well, it sounds like I have two feature requests here: one for logging span events and exceptions and a second one for real-time streaming metrics vs historical/tail-sampled ones.

A third feature request is to support other inputs like Otel Collector such as OTLP for metrics and logs, not I know that's long-term. And Loki's concept of a log is very different (text only) compared to that of Fluentd or OTLP for example (structured logging possible + predictable/standard resource attributes)

^ Completely off topic but, for Grafana product owners, what I struggle most with in transitioning to Grafana as a developer is that I'm really used to seeing higher level objects in my UI - such as "group by distinct" for Users, Sessions, Pages/Requests, API tokens, Browser UA strings, and more. I can see how metrics can help, because derived metrics can have high cardinality and automatic aggregation - to see for example which metric for function call spans takes the longest - though it's still hard to highlight outliers for further analysis. To that end, especially when including frontend exceptions there's often a need to group by exception and to ignore repeated occurrences especially if caused by third-party browser plugins. Obviously some of that logic can live in the client but for apps that aren't frequently updated, one would want to maintain that server-side.

I won't even get into how I'd need to combine uses of logs and traces to ensure completeness - I can use traces everywhere except to monitor my tracing code, such as Grafana Agent itself, or so it feels like.

LouisStAmour commented 3 years ago

Heh, regarding group by distinct, I just realized I might be asking for effectively "dynamic traces" or the ability to convert any arbitrary never ending sequence of events (like spans) to be instead grouped into parent-child representations. It's possible this could be done automatically by a frontend given a limited/recent sample, but it highlights that sometimes you want more groupings than a request-based trace would offer, such as traces grouped by session which would be hard to log due to how long it would take for a session to expire (longer than the span grouping by trace timeout...)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

rfratto commented 3 years ago

A third feature request is to support other inputs like Otel Collector such as OTLP for metrics and logs, not I know that's long-term. And Loki's concept of a log is very different (text only) compared to that of Fluentd or OTLP for example (structured logging possible + predictable/standard resource attributes)

Thanks. We've thought about this a few times, even before we designed the Grafana Agent.

Ultimately we decided to build an agent that's focused on Grafana Labs' opinionated telemetry stack (Prometheus/Loki/Tempo/Grafana). We recognize that there's a ton of agents that support translating between vendors, which is a low-level problem that is difficult to do in a full compatible and idiomatic way. We didn't really want to reinvent the wheel, so we decided to go for a focused agent instead.

By building a focused agent, we default to 100% compatibility with the backends we picked. This allows us to focus on higher level problems like user experience, sharding, and out of the box experience. It's just a different tradeoff: we give more compatibility for less flexibility of the backend. The plus side is that our stack is all open source, so you're not stuck with a hosted solution.

Honestly, if you really need to transform data between vendors, I'd recommend using one of the agents that tries to solve problems at this level, like Telegraf, Fluentd, Vector, or OpenTelemetry. They're all good, they just making different trade-offs than we are.

That being said, no decision is permanent and we're always willing to revisit this down the line, but it's still a "not right now" at the time of writing.

(PS: Next time, please create one issue per feature request :) I didn't realize this question was in here until it was pointed out to me.)

mapno commented 3 years ago

Hi @LouisStAmour, sorry I didn't reply earlier, I must have missed the issue.

I opened an issue to explore the possibility of logging events/exceptions with the automatic logging exporter (see https://github.com/grafana/agent/issues/750). Note that the processor can log any attribute that is in a span, on top of the default ones.

As for the second feature request

real-time streaming metrics vs historical/tail-sampled ones

I'm having trouble to following this. Do you mean two sets of metrics: one with all traces and one with non-discarded traces?

LouisStAmour commented 3 years ago

Do you mean two sets of metrics: one with all traces and one with non-discarded traces?

No, rather I'm thinking of scenarios where you have trade offs due to the underlying metrics architecture such that you can't go back and correct a metric after it's logged with only partial data. For example, let's say 2 out of 3 servers reported latency data that was normal but one of the servers had extreme latency that should have skewed the average but it took an extra minute or two before the data was received. With real-time charts, you would report the data you have, as you have it, and prefer current data over historical data. You would at most wait a few seconds because the longer you wait for data, the less accurate your graph would be.

However, if you wanted to make a report later on historically accurate data, you would need to process the metrics knowing all the data in advance, as if there were no delays in receiving data. You might give up to 10 minutes for all the data to be received in batches via Otel and for all traces to complete before you send the data to be archived via Cortex, as timestamps must always increase. Thus the historically accurate data could be delayed by minutes as a way of ensuring its accuracy.

Does this make more sense? The key ask is:

I've thought previously about maintaining two streams of metrics and logs: one real-time and one, more accurate, for archival purposes that can re-order data to better fit Cortex and Loki requirements.

Right now we have out-of-order problems and deal with them by generally using the agent timestamp as the date when a metric or log occurred. I'm suggesting that we could discard rather than log older data in a real-time scenario, and that we could take the time to re-order rather than re-timestamp data when we're trying to archive it.

But it's just another feature suggestion or way of looking at how Agent creates logs and metrics automatically from traces or other sources.

The ideal behaviour would be to correct the metrics as new data arrives, so you can view both real-time and historical data stream in and affect the graph live, but Cortex requiring that timestamps only increment due to how it saves data doesn't allow for this. So the compromise, to me, is to have two different metrics: one is real-time, and one is delayed, yet historically accurate if there is batching or processing of metrics in advance. The same applies to logs.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.