hrxi / tracing-loki

A tracing layer for shipping logs to Grafana Loki
Apache License 2.0
50 stars 20 forks source link

Fields of spans and log events that collide with each other are lost #8

Open Veetaha opened 2 years ago

Veetaha commented 2 years ago

Reproduction steps:

Run the code with loki and grafana for checking how the logs are delivered.

#[tokio::main]
async fn main() {
    let (loki, task) = tracing_loki::layer("{loki_url_here}", HashMap::new(), HashMap::new()).unwrap();
    tokio::spawn(task);
    tracing_subscriber::registry().with(loki).init();

    let span = error_span!("main", foo = "bar");

    async {
        // Notice that this log event has `foo` field just like the parent span
        tracing::info!(foo = "baz", bruh = "other", "log event");
    }
    .instrument(span)
    .await;
}

The log event in Grafana will be seen as:

{
  "message": "log event",
  // As you can see, the field value from the log event is lost, and contains only span's value
  "foo": "bar",
  "bruh": "other",
  "_spans": [
    "main"
  ],
  "_target": "app",
  "_module_path": "app",
  "_file": "app/src/main.rs",
  "_line": 10
}

This happens because spans are flattened into a single object. I am pretty sure it is done due to LogQL limitation that it doesn't support arrays.

Potential Solutions

Unconditionally name all fields that come from spans with the following approach:

[{span_name}_[{span_index}_]]{field_name}

span_name and span_index can be omitted if the resulting log event's field_name is entirely unique. span_index is the number index in the list of spans. It can be omitted if there is no other span with the exact same name in the event. I suppose we can have several spans of the same name if the code that creates spans names them the same, or there were recursive function calls.

With this approach, we will optimize for the regular case where it is very unlikely that there are collisions in field names. The downside is that the information from the same span can now be available under different field names dynamically if there are collisions, which makes LogQL queries for the same field name not workable.

Maybe tracing_loki could expose ways to format the resulting log event from spans and log event data in a configurable way so that users could decide how they want to resolve conflicts and how to lay out the log events in Loki at all? It could be exposed in terms of a callback that formats a JSON value as the most flexible way to provide this capability. This will also allow renaming the standard fields _file/_line/_module_path or dropping them if not needed just as users decide.

hrxi commented 2 years ago

Yea, this is sort of by design right now. I basically accepted the overriding and hoped that tracing users would not use duplicate field names. :P As you observed, LogQL has poor (no?) support for arrays.

Your solution would work to also ship the conflicting fields to Loki, but it would create unpredictable field names, making the field names depend on whether there are conflicting field names and also their call depth.

I assume you stumbled upon this problem in some real use case? Can you give an example so I can see what real problem we're solving? :)

Veetaha commented 2 years ago

I didn't face this in a real use case, I just wondered how this is implemented in tracing_loki, and tested this, because I initially wanted to make my app run inside of a span with version = "x.y.z", git_commit = "..." info, but then dropped this idea, because I always know I am running my app on my server, so there is no need in storing that info with each log event.

Though while writing this comment, I wonder if it makes sense to put that info into labels 🤔, WDYT?

Anyhow, an example of a collision would be creating a span with version = "x.y.z" for all logs in the application, that contains the version of the binary that is running. Then, suppose we have an optimistic concurrency version from a database record logged as version field in a log event, or there could be some other semantics applied to another version field in a log event (e.g. S3 object version ID, or a flavor of ice cream a user requests, or ... anything else).

I understand that this is a rare case, but I do think, that it should be at least explicitly called out in the docs, and there should be an escape hatch in form of mapping the log event as the user of this library wants.

As for the application I am working with. It is my pet project https://github.com/Veetaha/veebot-telegram

hrxi commented 2 years ago

that it should be at least explicitly called out in the docs, and there should be an escape hatch in form of mapping the log event as the user of this library wants.

I agree that it should be called out in the docs.

As for your specific example, that is easily worked around by calling these labels veebot-version and veebot-git-commit instead, once it's actually documented.