open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
326 stars 157 forks source link

Proposal: Supporting Real User Monitoring Events in OpenTelemetry #169

Open alolita opened 2 years ago

alolita commented 2 years ago

Real User Monitoring in OpenTelemetry Data Model

This is a proposal to add real user monitoring (RUM) as an independent observability tool, or ‘signal’, to the Open Telemetry specification. Specifically, we propose a data model and semantics which support the collection and export of RUM telemetry.

Motivation

Our goal is to make it easy for application owners to move their real user monitoring (RUM) telemetry across services. We aim to accomplish this by providing application owners with a standardized, platform agnostic tool set for recording RUM telemetry. Such a tool set would include (1) a common API and (2) SDKs that implement the API and support multiple platforms including web applications and native mobile applications.

To achieve this goal, we propose a modification of the Open Telemetry specification to support collecting and exporting RUM telemetry. Specifically, Open Telemetry currently supports three signals: tracing, metrics and logs. We propose adding a fourth signal, RUM events, which will be used to record telemetry for interactions between end-users and the application being monitored. See the Alternatives section for a discussion of why we propose a new signal over using an existing signal.

Background

What is RUM?

RUM allows customers to monitor user interactions within their applications in real time. For example, RUM can provide application owners with insights into how users navigate their application, how quickly the application loads for users or how many new users tried the application. RUM provides application owners with a way to rapidly address issues and improve the user experience.

Examples of RUM use cases include:

To enable application monitoring, RUM collects telemetry (e.g., button clicks, load times, errors) from applications with user interfaces (e.g., JavaScript in browsers, or native Android or IOS applications) and dispatches this telemetry to a collection service.

RUM Model

RUM is analogous to, but semantically different from tracing. While tracing records a compute operation, RUM records data relating to the experience of a user performing a task. We refer to the interaction between a user and an application to perform a task as a session. The diagram below shows the structure of a RUM session.

RUM Session Data Model

A session represents the interactions that occur between a user and an application while the user works to accomplish a task. Because an application is UI driven, RUM records telemetry based on which page (or UI) the user is viewing. This (1) allows engineers to correlate events with the UI that generated them, and (2) allows designers to view how users navigate the application. Pages have a set of attributes (an attribute is a key/value pair) and a list of events (an event is a named and timestamped set of attributes).

Because RUM aims to aggregate data from multiple sessions into metrics, it is unnecessary and impractical to export entire sessions from a client to a collector. Instead, we export events as they occur and aggregate events from multiple sessions in the collector, or later on using batch processing. The diagram below visualizes this relationship.

RUM Event Collection

Internal Details

The Open Telemetry specification currently defines three signals: tracing, metrics and logs. We propose adding a fourth signal, RUM events, which would provide observability into real user interactions with an application.

RUM Event Context Definition

RUM records and dispatch telemetry, such as button clicks and errors, in near real-time. To support aggregating this data across dimensions, context must be propagated with each event. The context for an event includes session and page attributes. Session and page attributes represent the dimensions by which events will be aggregated.

For example, consider a JavaScript (JS) error event. Context such as (1) page ID and (2) browser type must be propagated with the event to efficiently aggregate metrics such as (1) number of JS errors by page and (2) number of JS errors by browser type.

Events are grouped by (1) session and then (2) page. Session fields include:

Field Name Type Description
Resource Resource Uniquely identifies an application.
User ID string Identifies the user of an application. This can be either a random ID for unauthenticated users, or the ID of an authenticated user.
Session ID string Identifies a series of interactions between a user and an application.
Attributes map Session attributes are extensible. For example, they may include data such as browser, operating system or device.

Pages represent discrete UIs, or views, within an application. For web applications, pages can be represented by a URL, or more commonly, a subset of the URL such as the path or hash fragment. Native mobile applications will have a different type of ID for their pages. Page fields include:

Field Name Type Description
Page/View ID string Uniquely identifies a discreet user interface within an application. For example, a web application may identify pages by the URL's path or hash fragment.
Attributes map Page attributes are extensible. For example, they may include data such as the URL of the web page.

RUM Event Definition

Pages generate zero or more events. Events store and transmit information about an interaction between a user and the application being monitored. Event fields include:

Field Name Type Description
Timestamp uint64 An epoch timestamp in milliseconds, measured on the client system when the event occurred.
Event type string Uniquely identifies an event schema. The event type contains an event name prefix (e.g., com.amazon.aws) followed by an event name (e.g., dom_event). When the event is sent to a collection service, the event schema instructs the collection service how to validate and deserialize the event details.
Details object Each event has a unique schema. Event schemas are not fixed -- they may be created, modified and removed, and are therefore outside of the scope of this data model. This field contains a JSON object. This object adheres to a schema which is unique to the event type.

RUM Event Types

Because there is no fixed set of RUM events (RUM event types may be created, modified or removed), specific events are not part of the RUM data model. Examples of RUM event types may include, but are not limited to:

Example of a RUM event record

We display this example as a JSON object, as JSON is natively supported by JavaScript and web protocols. Alternatively the SDK may transmit the record as a protobuff.

{
  resource: {
   application_id: '2ecec2d5-431a-41d5-a28c-1448c6284d44'
  }
  user_id: '93c71068-9cd9-11eb-a8b3-0242ac130003',
  session_id: 'a8cc5ef0-9cd9-11eb-a8b3-0242ac130003',
  session_attributes: {
      browser: 'Chrome',
      operating_system: 'Android',
      device_type: 'Mobile'
  },
  page_id: '/console/home',
  page_attributes: {
      host: 'console.amazon.aws.com',
      path: '/console/home',
      hash: '#about'
  },
  event: {
       timestamp: 1591898400000,
      type: com.amazon.aws.dom_event,
      details: {
           event: 'click',
          element_id: 'submitButton'
      }
  }
}

What does this data look like on the wire?

Events are human generated and are therefore sparse. We estimate about 1 - 60 events per minute, per user, depending on the application. The number of events for a single session is small, however because of the volume of users, the cost of network calls and storage may be high compared to the value of the data, and therefore the number of events may be capped or events may be sampled. For example, events for a session may be capped at a few hundre

Alternatives / Discussion

Why create a RUM event signal instead of using the log signal?

Benefits of transmitting RUM telemetry using the log signal include: (1) less work would be required to modify and implement the Open Telemetry specification, and (2) the complexity of the Open Telemetry specification would not increase substantially.

We proposed creating a new data model over using the existing logs signal. Using logs would require soft contracts between (1) the application and the SDK and (2) the SDK and the collector. Such soft contracts, without having a standardized and strongly typed API, could fracture SDK implementations. This would affect maintainability and the portability of RUM telemetry.

Some aspects of the RUM signal may also be cross cutting concerns, which is not supported by the log signal. For example, it may be valuable to propagate RUM context (e.g., session ID, page ID, UI events) across API boundaries, so that downstream executions can be associated with the user interactions that triggered them.


By creating a new signal, we get stronger typing at the expense of adding complexity. For example, we would not create new signal types for databases, pub sub, etc.

We view Databases and PubSub as specific technologies that need to be monitored, while tracing, metrics and logging are monitoring technologies. Our proposition is that (1) real user monitoring, like tracing, metrics and logging, is a monitoring technology and that (2) there are advantages to treating real user monitoring as a first class monitoring technology within Otel.


Could we use semantic conventions instead of a new signal by packaging RUM data as traces?

The opentelemetry-js and opentelemetry-js-contrib SDKs already capture certain browser activity associated with RUM (i.e., http requests, document load behavior and DOM events) as traces. Conceptually, we view tracing as the process of recording the execution of a program. This fits very well for specific web application execution activities like HTTP requests, load timing and executions that are initiated by DOM events.

However, we view RUM as the process of recording the experience of a person interacting with a program, which something that traces cannot effectively model. Because RUM is driven by human interactions with the application, we need a system which can capture events over a long period of time and link the events together into a timeline of the user’s experience.

RUM events can model many different types of telemetry, such as: traces, errors, sequences of DOM interactions, web vitals measurements, etc. These events must be associated with a RUM session and a view of the application (i.e., the page the user is viewing). The Splunk SDK (i.e., opentelemetry-js + splunk-sdk-javascript) makes this association by attaching the session ID and page URL to spans as attributes.

The long-term problem with using traces for recording RUM sessions is that (1) there is no guarantee that each implementation behaves the same, reducing data portability, (2) many events are not traces, which violates object oriented design principles and reduces the maintainability of the SDK, and (3) it makes it more difficult to define and validate events.

Regarding (1), we would like the ability to change RUM providers with minimal changes to an application’s monitoring instrumentation.

Regarding (2), we would like the ability to define session attributes (e.g., browser, device, platform), page attributes (e.g., page ID, page URL, page interaction level) and event attributes.

Regarding (2), we would also like the ability to install plugins in the SDK which record RUM events. I don’t think using traces or logs prevents this, however I think it reduces maintainability.

Regarding (3), we would like the ability to define schemas for events so that (a) we can type-check events when implementing RUM SDKs, (b) we can verify that incoming event payloads are valid during collection, and (c) we can query the event data after it is stored.


Could we use semantic conventions instead of a new signal by packaging RUM data as logs?

Logs stores unstructured data and also suffers from (1) and (3) above. In addition, it might be beneficial to separate the log and RUM event traffic so that the collection service doesn’t need to do so.


Could we achieve stronger type saftey on top of existing log or trace signals, for example, by adding a projection layer on top of these signals?

Potentially -- would a projection layer improve understandability and maintainability compared to adding a new signal?

cc: @qhanam

svrnm commented 2 years ago

This is a great summary, thanks for taking the time to write it down!

I have a few remarks & questions:

Oberon00 commented 2 years ago

I agree that this is the biggest challenge with the current span/trace concept, but I argue that this could be modelled with a span, if it either they can be created without a finish timestamp and be updated later. Or there is a way to create 2 events and connect them to an event some time later.

I agree with that. Maybe we could even model spans based on logs, to make the hypothetical "streaming SDK" the default (at least it sounds interesting as a thought experiment).

Could we use semantic conventions instead of a new signal by packaging RUM data as traces?

If traces are problematic (which they are in current implementations/protocols), how about:

Could we use semantic conventions instead of a new signal by packaging RUM data as logs?

There is no reason semantic conventions couldn't also apply to logs, which can carry highly structured data in OTel AFAIK.

All the arguments brought up against using semantic conventions for RUM events are general arguments against semantic conventions and not specific to RUM IMHO. And I do agree that a strongly typed API would be beneficial. The idea of "typed spans" was around from the beginning. A strong use case could help give it traction (even if we start with "typed logs").

weyert commented 2 years ago

How would you integrate this with for example Segment (https://segment.com)?

stefaneberl commented 2 years ago

There is a last point that's always critical with User Monitoring and should be part of the considerations: storing a session and user id leads you directly into privacy conversations: you might be able to send pure technical telemetry (timings, errors) but not associate them with anything that might help to identify the user, which breaks the whole session concept.

I don't think this OTEP should worry too much about privacy concerns (e.g. GDPR). Storing and setting PII, e.g. session_id, should be handled by the web app/mobile app developers.

qhanam commented 2 years ago

Thanks for this great feedback and insight!

I am extremely in favour of re-using traces/spans instead of "RUM Events", so let me argue for it: it adds no complexity at all and is a logical extension of the existing concepts (i.e. for me a span is just 2 connected events): a session is a span, a page load is a span, DOM events are events within that "page span", remote calls are spans, errors are events (or logs), web vitals are spans, a "click" or any other kind of interaction is an "event".

Does this propose to model the entire session as a trace? Would the trace ID in this case be the session ID -- with multiple remote calls having the same trace ID?

This sounds like an elegant solution, since user interactions are inherently connected to compute operations. However, are there technical implications of using a trace to record events over a long period of time?

For example, monitoring systems handle traces and events in different ways:

Our concern is that certain spans from a single trace will need to go to different destinations for processing (in this example real user monitoring or distributed tracing) depending on their contents. The real user monitoring system aggregating events and the distributed tracing system building trace graphs. Since spans are performing two different functions, we were thinking it might be better to model these types of events separately rather than (1) adding functionality to spans or (2) creating semantic conventions for spans, to handle both use cases.

I agree with that. Maybe we could even model spans based on logs, to make the hypothetical "streaming SDK" the default (at least it sounds interesting as a thought experiment).
...
There is no reason semantic conventions couldn't also apply to logs, which can carry highly structured data in OTel AFAIK.

All the arguments brought up against using semantic conventions for RUM events are general arguments against semantic conventions and not specific to RUM IMHO. And I do agree that a strongly typed API would be beneficial. The idea of "typed spans" was around from the beginning. A strong use case could help give it traction (even if we start with "typed logs").

So in theory everything (i.e., spans, metrics, events) could be streamed through logs? Is the advantage of this that the OTLP handles all data in the same way, and we offload the packaging and unpackaging of data to the SDK and collector respectively?

For real user monitoring events, one of the things we want to do is to associate each event with a schema so that the monitoring system knows how to read each event, and can aggregate or query the events. It would be useful for all SDKs to use the same format for specifying meta data such as this to ensure the data is portable across monitoring providers.

There is a last point that's always critical with User Monitoring and should be part of the considerations: storing a session and user id leads you directly into privacy conversations

Does this depend on whether the user ID and session ID can be linked to an individual? If data needs to be aggregated per-session then we require a session ID. The user ID and session ID can be random; i.e., UUID V4 generated exclusively for the purpose of client-side application monitoring. It is also up to the application to ensure no PII is logged through client-side monitoring, or PII is logged that it is properly managed.

svrnm commented 2 years ago

I agree with that. Maybe we could even model spans based on logs, to make the hypothetical "streaming SDK" the default (at least it sounds interesting as a thought experiment).

@Oberon00 I love that thought experiment. I've seen multiple cases where logs where a substitute for traces/spans and events that need to be stitched together where floating around loosely.

The idea of "typed spans" was around from the beginning. A strong use case could help give it traction (even if we start with "typed logs").

Same here, I have not been involved in those initial discussions, but having the possibility to type signals and validate their schema along their way into a data store would be beneficial outside RUM. Actually, if you look into semantic conventions for traces, there are some types & rules already, e.g.: "http.url MUST NOT contain credentials passed via URL"

How would you integrate this with for example Segment (https://segment.com)?

@weyert What kind of integration would you expect? Like segment integrates with google analytics?

I don't think this OTEP should worry too much about privacy concerns (e.g. GDPR). Storing and setting PII, e.g. session_id, should be handled by the web app/mobile app developers.

@stefaneberl not worry too much but make sure that requirements are not making the life of developers hard, e.g. making a session_id mandatory introduces tracking out of the box and it might be good to have a "stateless" version. Or it's just a question of wording, like in the semantic conventions on user_id: "Given the sensitive nature of this information, SDKs and exporters SHOULD drop these attributes by default and then provide a configuration parameter to turn on retention for use cases where the information is required and would not violate any policies or regulations."

However, are there technical implications of using a trace to record events over a long period of time?

@qhanam Yes, but I argue that those implications are good (see @Oberon00's thought on "streaming"), since RUM is not the only use case for that.

I have to be honest, I just might be dense and not understand the point you are making on handling RUM events and traces in a different way: When I do performance analysis of Real User Monitoring data I look at the session in a waterfall view or in a graph view where a node is a page and an edge is a transition. When I look at a page I want to see events (first byte time, dom building time, ajax events, resource loading) in a waterfall view or in a graph view showing me dependencies. That's no different to me from a distributed tracing. So in both cases, the SRE and the Software Engineer can work with the same kind of data, with similar (or even the same) kind of visualizations.

(2) creating semantic conventions for spans, to handle both use cases.

There is already semantic conventions for traces/spans: https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions

For real user monitoring events, one of the things we want to do is to associate each event with a schema so that the monitoring system knows how to read each event, and can aggregate or query the events. It would be useful for all SDKs to use the same format for specifying meta data such as this to ensure the data is portable across monitoring providers.

How is this different from semantic conventions?

Does this depend on whether the user ID and session ID can be linked to an individual? If data needs to be aggregated per-session then we require a session ID. The user ID and session ID can be random; i.e., UUID V4 generated exclusively for the purpose of client-side application monitoring.

weyert commented 2 years ago

How would you integrate this with for example Segment (https://segment.com)?

@weyert What kind of integration would you expect? Like segment integrates with google analytics?

Well, the feeling that I get of this proposal is that we want to send the same things twice, once as part of Opentelemetry, and once again as part of the feature flagging and/or product analytics solution. To me, I would expect I can forward RUM events to such systems, maybe via a Otel Collector exporter but I would expect I can share or reuse the following events:

Other integrations could be PostHog, Google Analytics, Mixpanel etc. The typical application UX or Product Business Analysist use

stefaneberl commented 2 years ago

@svrnm I do agree to have session_id, user_id, etc. optional.

mhennoch commented 2 years ago

Mine and @t2t2 thoughts:

We don’t see the upside of this over already existing signals (traces, logs, metrics). The lack of a single identifier to link all of the data coming from one session could be already solved by using a Resource (for example on APM side service.instance.id is a similar concept, separate question RUM side should also be service or called something else (open-telemetry/opentelemetry-specification#1681)). Resources are supported by tracers and logs so it can be easily shared cross-signals, and relevant trace information can be linked to logs.

Current resource spec does have a limitation that the resource used by a provider cannot be swapped out during provider's lifecycle, and imo this should be changed to allow for changing for use cases like RUM (eg swapping out the resource with a userId when the user identifies).

Rest of the concerns do seem to be more about semantic conventions & it's compliance as pointed out by @Oberon00, these concerns should rather be on processing side to filter out data that isn't useful/compliant (which you're gonna have to do on data from RUM anyway as this data is completely untrustworthy and can be manipulated for lulz or even spam - search google analytics referrer spam)

There's also discussion about streaming SDK, at least the current otel-js SDK doesn't have any limitations against implementing a processor that does something on span start and end separately (onStart & onEnd methods on SpanProcessor interface). This seems more of an issue that there currently aren't any exporters that support streaming (at least all of the current exporters in opentelemetry-js do it on spanEnd), creating an issue for long-term/early termination traces (for example if you'd want to trace tab visibility, which has a start, sometimes end but it can be hours, days, months...).

Maybe it would be better to build on OTLP trace format to also support partial traces that can be later combined by the same trace_id & span_id into a OTLP trace?

johnbley commented 2 years ago

I argue that representing RUM data within the existing otel tracing data model (and existing otel tracing code) is feasible. My data to support this argument is that we (Splunk) are already doing it.
All are welcome to browse https://github.com/signalfx/splunk-otel-js-web and look at our contributions to opentelemetry-js and opentelemetry-js-contrib.

Yes, there are semantic conventions to iron out (e.g,. agree on what attribute represents a "session id" and how opaque the value is). Yes, there are some oddities ("we saw this exception happen but there is often no current span to attach it to.... I guess we'll just create a 0-duration "span" to represent that?"). Yes, as we have moved into supporting mobile RUM with the same otel tracing technology, we have discovered more corner cases in modeling.

But! The existing otel architecture works, and works pretty well. We have shipped a product based on it, and didn't have to re-invent things to make it work, including re-using existing protocols and data processors.

I feel that the counteraguments in the section Could we use semantic conventions instead of a new signal by packaging RUM data as traces? are hand-wavy and ignore the reality that a major otel contributor is in fact doing exactly that.

But, to counter-counter-argue with two of the points raised there:

jpkrohling commented 2 years ago

I wanted to express that I also believe that this could be built on top of the tracing signal we have already. I see a session as a collection of traces, and a trace is a collection of events/spans. Perhaps we just need a new primitive in the data model, for the Session data structure?

gergas3 commented 2 years ago

Thanks for putting together this proposal @alolita ! I'm working with @johnbley, on analyzing the RUM span data after ingestion. Putting some thoughts here based on my experience with processing RUM data.

While tracing records a compute operation, RUM records data relating to the experience of a user performing a task.

Spans have worked well for us because most RUM things have a duration. This duration may mean waiting for resources to load but I imagine there can be true computations in the browser that are of interest, so I don't find this very different from the "compute operation" meaning that backend spans capture. I imagine spans would be problematic if you have to wait too long to close them, and instead you'd prefer to report at least the start of something without waiting for the end. As far as I know this is NOT the case currently with RUM things.

To me the other major difference between the proposed event model and the span model is no parent-child relationship. Is that something that is not meaningful for RUM or users do not want to see? Are there things happening in the browser that can be modeled as parent-child?

I agree that there are some RUM things that are not spans. These are metrics in my view. They have a timestamp, a metric name, some attributes and a metric value. E.g. web vitals. Not sure whether they can be linked to any other span "causing" them.

Events seem like the super type for metrics, logs, and spans, if an event is just a timestamp and stuff. Out of that stuff certain attributes are elevated to become a "field". Metrics elevate a numeric value to field-level. Logs elevate the message, spans elevate duration and causality. (Disclaimer: I'm only familiar with the span data model of OTel.)

In that spirit I like the idea of turning some attributes into fields to avoid conventions. Session ID definitely, maybe user ID and the page/view name?

I see a session as a collection of traces, and a trace is a collection of events/spans.

I agree. And maybe even backend spans would benefit from session ID. I miss "user interaction" (e.g. one click) from the grouping: session -> interaction -> trace -> spans. A session is a series of interactions, and each interaction may spawn several backend traces. Not sure if it's of interest for users.

To sum up, if mostly RUM things are the same as a span (duration and causality) then maybe we can propose adding session ID to backend span model. It would allow new kind of analysis, would put backend events in the context of preceding events. RUM could handle user ID and page view as semantic convention. Maybe we could have metrics emitted as some metric object (no duration, no causality, yes numeric value)?

scheler commented 2 years ago

To me the key takeaway from this proposal is it introduces a new type of context in a dimension orthogonal to a trace - called a session. Typically, this is a timeline. Signals in this context are typically batched and reported periodically, say every minute. This can be called timeslice, equivalent to a span in a trace, and would serve as a container to report metrics and logs/events. We needed a way to represent the duration outside of a trace' span, so the timeslice could provide that. This would avoid the 0-duration spans.

The other part of the proposal is that it introduces "Events" as a high level construct, going against the following in the logs spec.

From OpenTelemetry's perspective Log Records and Events are different names for the same concept.

The distinction being Log Records have a severity field and Event records have a name or type. If we want to have a common record type, may be, we could drop both these fields from the data model.

In summary, this proposal could formalize the new context type and improve on the subtlety between Logs and Events.

svrnm commented 2 years ago

From what I understand, there are now the following proposed solutions to cover RUM use cases with OpenTelemetry:

(1) Introducing a new Signal: RUM Event

The initial proposal from @alolita and @qhanam suggested a fourth signal "RUM Event" to cover end user tracing use cases.

(2) Extending existing concepts

@johnbley I argue that representing RUM data within the existing otel tracing data model (and existing otel tracing code) is feasible. My data to support this argument is that we (Splunk) are already doing it.

There were some variants on how to extend existing concepts to cover RUM use cases

(2a) Streaming SDK

@Oberon00: "Maybe we could even model spans based on logs, to make the hypothetical "streaming SDK" the default (at least it sounds interesting as a thought experiment)."

(2b) Extending spans to allow end/duration to be set later

@mhennoch: "Maybe it would be better to build on OTLP trace format to also support partial traces that can be later combined by the same trace_id & span_id into a OTLP trace?"

(2c) New Primitives

(2d) New Context Types

Additional Discussions

Additionally, there are 2 side-discussions:

(3) @weyert "the feeling that I get of this proposal is that we want to send the same things twice, once as part of Opentelemetry, and once again as part of the feature flagging and/or product analytics solution."

(4) @svrnm "storing a session and user id leads you directly into privacy conversations" (This discussion is closed as of now.)


I tried to give a condensed summary, I hope I didn't miss anyone's view or did not represent it correctly.

Here is my current point of view:

As stated before, I favour extending existing concepts and especially 2a or 2b, ie a mechanism that allows spans to be partial and by that finished later. The big advantage I see here, is that this is not only useful to User Monitoring: every kind of "long running process" (which includes user sessions) right now suffers from the fact that a parent (or linked) span might never be terminated, either by accident (crash before Span.end()) or by nature (the end is given by the last child span and we never know if there will be more).

With that a new event type like in (1) is not required, a "session id" (like (2c)) is just the "trace / span ID" of the parent (the "real" session id could still be an attribute), and a "time slice" (2d) could be a linked span that leverages that mechanism.

martinkuba commented 2 years ago

Thanks to @alolita and @qhanam for putting this proposal together!

A lot has been discussed already; let me add just a few thoughts for further consideration...

Sampling

One topic that has not been discussed yet is sampling. If I understand correctly, the proposal implies that events are always collected and then aggregated in the backend. I think some data should always be collected (e.g. page counts, error counts), and some should be sampled, e.g. span representing XHR/fetch call that is part of a trace with backend spans. Backend services should continue to sample traces/spans; otherwise this would significantly affect existing traces collection.

On the other hand, collecting client spans that sometimes are connected to backend spans and sometimes not, means that existing tracing ingest systems will potentially get a large number of single-span traces (and potentially confusing experience for users).

If sampling should be part of this spec, then we should discuss how this can be accomplished. Sampling in RUM is inherently challenging due to the instrumentation (clients) being distributed.

Interactions

Another topic that I am wondering about is how to represent interactions. Should interactions be represented as independent events/logs? Or, should the instrumentation capture causality, e.g. this click resulted in these network calls and compute operations? If the latter, then should interactions have duration? Our instrumentation (New Relic) currently does capture interactions as spans with duration.

I think there is a use case for looking at how often users perform a certain action (page load, click). And there is a separate use case for understanding what happened in a specific session as a result of an interaction.

CLS-like metrics

One last thing that comes to mind is support for the CLS metric. The value for this metric changes throughout the lifetime of a page. If it is streamed, then the backend will need to update to the latest value (and discard previous values). Perhaps this is not unique to RUM, but wondering if there should be built-in support for these kinds of metrics.

weyert commented 2 years ago

Is there a definition of RUM what is exactly covered under it? As part of the interactions I can imagine something like rrweb would be useful too: https://www.rrweb.io

This does ask the question whether the current OTLP protocol supports binary blobs?

martinkuba commented 2 years ago

What would be the best next steps to move this discussion forward? Would it make sense to create a new SIG?

jkowall commented 2 years ago

I think this is a very interesting proposal, but I do believe this is close enough to existing signals where it would be represented with a schema on top of tracing or logging pretty easily. I do think that javascript instrumentation is essentially RUM, and we have instrumentation for browsers already within the Otel community. Where it gets more interesting is when we start looking at streaming technologies such as those which @weyert mentions along with video streaming. RUM has a lot of challenges dealing with long running streams or measuring quality of streams today. It seems outside of the scope of a basic RUM system, but seems like the "new" problem to solve. We've been building RUM on browser instrumentation since 2008 (https://github.com/akamai/boomerang)

mc-chaos commented 2 years ago

Hi, there is a boomerang Plugin to integrate opentelemetry in/with beacons. boomerang-opentelemetry-plugin We have this in Production and it works great. At this time only interne, but we are in the process to bring boomerang and otel-plugin to the "wild" ... Regards Sascha

jkowall commented 2 years ago

That's sweet @mc-chaos will check that out!

We are also building a synthetic system based on selenium which exports traces you visualize in jaeger. It's a nice hack showing how traces are perfectly suitable for browser based data.

mc-chaos commented 2 years ago

Hi @jkowall , that sound cool. We are thinking about a selfhosted syncthetic system based on Robot-Framework with selenium or/and playwright framework under the hood. Mostly for internal Web-Apps, so no cloud/no SaaS . We have good experiences with instrumenting JMeter for load-tests / QA-Tests. Regards, Sascha

AaronCreighton commented 1 year ago

Hi all,

I like a lot of the thinking I have seen above. Most of it is from a "how does this fit with current Optel". I thought I would give my 2c, from an alternative perspective as a potential customer/user. My background is in Digital Analytics, primarily Google's UA platform but others too.

Digital Analysis also has the desire to achieve Observability and some of the vendors get close. Adobe Analytics (if your willing to pay for it) has very powerful slice and dice capabilities. Sitecore is particularly good for optimization and user journey analysis. Then you have tag management systems that enable fast deployment of tracking, for customized events. A key objective (or the key) is to capture the state of the user.

With the rise in concerns around privacy and tracking there is a shift from capturing the data client-side to server-side. In addition, the deployment of custom events has gone from tag management injections to data-layer first approaches. Thus whether or not you use server-side tracking, there is a move to have FE/BE developers create the needed custom events.

It would make a lot of sense to me that the telemetry a FE/BE creates for Observability of the product would also be the source for Observaiblity of the User. A key benefit, User Behavior tools are used to capture errors in UX, but also in the App. In the App scenario, the current process is often to capture as much about the error as possible. Then after try to find out what happened in the App's FE or BE that caused the error. It makes much more sense to have the RUM connected to the trace. This makes it much easier to find out the cause of the error the user experienced.

Another, key benefit here is that Engineers/Developers would only have to manage one agent / data collection process. The digital analytics products would then just be another sink. Or potentially even better, a sink that does both. There are vendor products that work in this space. Such as Raygun, although it's RUM is good from a App Dev perspective but not from a user behavior perspective. Another (maybe), Microsoft App Insights which is in the process of development but intents to import FE & BE Open telemetry. Unclear, if or how they intend to capture RUM from it.

Open Telemetry could play a key roll in this. The potentially challenge I see here is that ideally the automatic RUM captured, would need to meet the requirements of the default tracking in products like GAV4, Adobe, Matomo & Tealium.. If it doesn't then Digital Analytics teams are going to want to use their own agent that does. I am not saying that this has to happen, Optel community can forge it's own path around RUM. with new sinks / or current ones can develop to focus on RUM. However, it would increase adoption if it did.

What does this all mean from Optel implementation?

  1. What needs to be captured so that it has at least feature parity with default tracking from Matomo or GAV4? Or is the goal to have basic simplified RUM like Raygun? - I would not want to switch (to a new sink & Optel), if it did not have feature parity to at least Matomo.
  2. These tools do have some interesting default metrics like bounce rate, some capture mouse stops.
  3. capture user event order? this is key to RUM / behavior analysis & aggregating users in to funnel charts or similar visuals to show user flow. For example, a user visits a page and starts a page span. Common attributes would be what was the previous page visited? In Optel, one method for capturing this would be a parent-child relationship? That seems to be how it is captured in tools like Sitecore.
  4. What about single page applications..? For GA we would have data-layer events that would fire on a "new page". In Optel?

If I was to put this in to spans.

User span, with attributes > Session spans, with attributes > "page" spans (in parent child relationship) > user events on page spans, with attributes *> FE or BE trace spans., with attributes

*order of events on page also parent-child or another method?

On a side note, has anyone tried the user interaction presented here?: https://www.npmjs.com/package/@opentelemetry/instrumentation-user-interaction. Is it useful? does it give an alternative idea as to the list above by @svrnm?

mc-chaos commented 1 year ago

Hi @AaronCreighton , we have a OTel + RUM (Boomerang.js ) Bridge in Production, with creates Metrics and OTel Traces and makes the bridge between Frontend (JSP/HTML or SPA) and Backend (Java). Look at:

The EUM Receiver exports for now via Jaeger gRPC, but in the next Releases there will be a OTLP Exporter. But OTel-Collector could receive Jaeger gRPC, so there is no problem in the OTLP Pipeline.

We like to expand from Tracing-Focus now to WebAnalytics next Year. We started with OTel in the Backend and expanded this Year to the Frondend (Correlation of BE + FE Traces). Next Year we would look more at WebAnalytics.

Btw. The Plugin "User-Interaction" are we using in the boomerang-opentelemetry-plugin. It is great for Frontend-Error-Analyses like "We the hack is my dom rendering so slow on some of the Clients. Hint, VDI and SPA not the best friends.."

Regards, Sascha

svrnm commented 1 year ago

@AaronCreighton , @mc-chaos : there's a SIG called "client side telemetry" which is looking into RUM & OpenTelemetry, if you're both not member there, I think bringing your ideas & questions to that group would help to move things forward:

Slack Channel: https://cloud-native.slack.com/archives/C0239SYARD2 SIG Meeting is Every Wednesday at 8:00 AM PT: https://docs.google.com/document/d/16Vsdh-DM72AfMg_FIt9yT9ExEWF4A_vRbQ3jRNBe09w/edit

AkselAllas commented 1 year ago

Hi @AaronCreighton , we have a OTel + RUM (Boomerang.js ) Bridge in Production, with creates Metrics and OTel Traces and makes the bridge between Frontend (JSP/HTML or SPA) and Backend (Java). Look at:

Looked into the plugin. Sadly for my usecase, the use of bower from boomerang and everything being js only instead of typescript created too many conflicts and hacks. I would assume a lot of other people would have similar problems.

FWinkler79 commented 2 months ago

@jpkrohling

I wanted to express that I also believe that this could be built on top of the tracing signal we have already. I see a session as a collection of traces, and a trace is a collection of events/spans. Perhaps we just need a new primitive in the data model, for the Session data structure?

A Graphical Perspective

I agree to that view and thought I would add a graphical representation to this discussion:

RUM Architecture using Traces drawio

Personally, I feel that adding an orthogonal extra signal type for RUM would be the wrong approach. And looking at the experiences from Splunk I think we have a strong indication that using traces / spans for RUM purposes is feasible.

To me, things like a session, page, user action or interaction are just additional aspects or characteristics - i.e. essentially context information - for trace spans. And while a session may group together multiple traces, it is just context information that needs to be attached to all of them (see picture above). Likewise, the page, the user action etc. are just additional pieces of context that are used to correlate spans or multiple traces together.

I think, these context attributes need to be defined as part of the semantic conventions e.g. as rum.* attributes, and not all of them can / will be mandatory. To me, this is one more good reason not to generate an orthogonal model with lots of optional fields, but to really treat each of these context attributes as optional characteristics of different RUM use cases. This keeps it flexible and easily extensible. Based on the existence of attributes on the trace spans, add-ons on top of a standard tracing backend can then interpret the data and generate the RUM-specific views on them. The underlying infrastructure can be purely based on traces and semantic conventions though.

Regarding Sampling

One interesting question that was brought up by @martinkuba is the question of sampling. Here, I would be curious, what requirements RUM use cases have in general. If a user interaction results in two or more roundtrips of HTTP calls, what's the expectation towards sampling these?

In the picture above, if we wanted to make sure that all spans of all traces belonging to the same session are sampled, this would probably result in a kind of tail-based sampling, where we'd have to make sure that all spans of all traces belonging to the same session / page / user interaction / etc. are sampled.

I struggle with the idea, that out of an end-to-end transaction triggered in the frontend by a user, only parts of that transaction (e.g. only page counts, etc.) are sampled but the HTTP calls resulting out of them are not. Do we have more details on that?

I am curious how that resonates with the people here that think that traces and semantic conventions should be able to do the job. Does that make sense?

svrnm commented 2 months ago

@FWinkler79 this OTEP turned out to become a dedicated SIG, you can bring your suggestions, thoughts and questions to them by either attending a SIG call or via CNCF slack:

They meet every Tuesday at 9:00 AM PT, see this Google Doc for meeting notes

The slack channel is #otel-client-side-telemetry

To get invite for the meetings join this google group: calendar-client-side